Introduction to NLTK – Natural Language Toolkit
Hi Everyone! In this article we will learn about Natural Language Toolkit – “NLTK”. NLTK is one of the key libraries which is widely used for Natural Language Processing in Python. NLTK can be used in a variety of ways improving your way of looking at text.
Let’s being with importing NLTK. Then, we will download required packages of NLTK which include prepossessed data as well as data of many renowned books.
# Downloading required packages nltk.download('all')
from nltk.book import *
*** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908
When we import book from NLTK, it always shows us examples as shown above and if we want to call any one of those sentences, then we can simply type its key in python and it will provide us its data.
In this case keys correspond to “text1”, “text2” and so on.
Lets try calling
<Text: Sense and Sensibility by Jane Austen 1811>
It simply provided me text corresponding to key
Searching data in book/text provided by NLTK
To be specific
text2 in it self contains complete book, that is, it contains complete book
Sense and Sensibility by Jane Austen 1811
And we can traverse that book and search for words and sentences related to those words.
So, we will be using a function
text2 and will search a word in it let’s say
love. We can do it as shown below.
Displaying 25 of 77 matches: priety of going , and her own tender love for all her three children determine es ." " I believe you are right , my love ; it will be better that there shoul . It implies everything amiable . I love him already ." " I think you will li sentiment of approbation inferior to love ." " You may esteem him ." " I have n what it was to separate esteem and love ." Mrs . Dashwood now took pains to oner did she perceive any symptom of love in his behaviour to Elinor , than sh how shall we do without her ?" " My love , it will be scarcely a separation . ise . Edward is very amiable , and I love him tenderly . But yet -- he is not ll never see a man whom I can really love . I require so much ! He must have a ry possible charm ." " Remember , my love , that you are not seventeen . It is f I do not now . When you tell me to love him as a brother , I shall no more s hat Colonel Brandon was very much in love with Marianne Dashwood . She rather e were ever animated enough to be in love , must have long outlived every sens hirty - five anything near enough to love , to make him a desirable companion roach would have been spared ." " My love ," said her mother , " you must not pect that the misery of disappointed love had already been known to him . This most melancholy order of disastrous love . CHAPTER 12 As Elinor and Marianne hen she considered what Marianne ' s love for him was , a quarrel seemed almos ctory way ;-- but you , Elinor , who love to doubt where you can -- it will no man whom we have all such reason to love , and no reason in the world to thin ded as he must be of your sister ' s love , should leave her , and leave her p cannot think that . He must and does love her I am sure ." " But with a strang I believe not ," cried Elinor . " I love Willoughby , sincerely love him ; an or . " I love Willoughby , sincerely love him ; and suspicion of his integrity deed a man could not very well be in love with either of her daughters , witho
It not only provided us search results for word
love, it also provided us some text around our searched word.
Finding similar words in book using NLTK
We can also find similar words from book, like what other words may mean
love in whole book or some other word of your choice.
For doing so we will use function
text2 and will provide word “love” just like earlier to fetch similar words.
affection sister heart mother time see town life it dear elinor marianne me word family her him do regard head
From above result we can see that we are getting words as well as parts of sentences which may represent word
love in one way or another. For example “affection”, “dear”, “family”, “regard” etc.
Similarly we can try it for other words too, like “danger”, “laugh” etc.
house time letter living world furniture visit change family alteration person it children heart half first whole hope day moment
be look manner house person mother day moment concern subject ruin smile week letter tears and by of in at
similar() function provides us other words from book which may have been used in context of searched word.
Common words similar to 2 or more words
common_contexts([list_of_words]) we can check which words are used as common for 2 or words in same context.
was_young was_well was_much know_well is_well it_much him_well are_good be_much been_much had_much the_great you_far
From above examples we can see that by using function common_contexts(), it is providing us all those words which have been used in similar context of 2 given words.
Set of punctuation & words used
We can also get set of all words & punctuation that are used in a book (‘text’ in this case). To get that you can use
import pandas as pd temp_df = pd.DataFrame(set(text2)) temp_df.head()
temp_df = pd.DataFrame(sorted(set(text2))) temp_df.head()
We can see that set() function is providing us complete set of words. You can try running just set(text2), it will print complete list.
To get the number of different words or vocabulary including punctuation you can use
number_of_distinct_elements = len(set(text2)) number_of_distinct_elements
total_words_in_book = len(text2) total_words_in_book
Now, to check the % of distinct words we can simply divide them.
print((number_of_distinct_elements /total_words_in_book) * 100 )
It shows a very interesting result to us, that is, complete book comprises only
4.83% distinct words & punctuation. Whole book is just an amalgamation of of such few words.
Checking dispersion of specific words
We can also check dispersion of specific words throughout book on a plot, using
import matplotlib.pyplot as plt plt.figure(figsize=(20, 8)) # to increase default size of figure text2.dispersion_plot(["love", "very", "too", "hate"])
Every vertical line in above graph, shows us the location of of each word in actual book.
Let’s apply above techniques on our own text.
smaple_words = "It is good to be someone's charioteer but in order to become " \ "someone's charioteer you have to be someone first, who never " \ "misses his target. Both are important roles that one has to play " \ "in this life. Hi Everyone! In this article we will learn about " \ "Natural Language Toolkit - 'NLTK'. NLTK is one of the key libraries " \ "which is widely used for Natural Language Processing in Python. " \ "NLTK can be used in a variety of ways improving your way of looking at text." smaple_words
"It is good to be someone's charioteer but in order to become someone's charioteer you have to be someone first, who never misses his target. Both are important roles that one has to play in this life. Hi Everyone! In this article we will learn about Natural Language Toolkit - 'NLTK'. NLTK is one of the key libraries which is widely used for Natural Language Processing in Python. NLTK can be used in a variety of ways improving your way of looking at text."
tokens = nltk.word_tokenize(smaple_words) tokens[:10]
['It', 'is', 'good', 'to', 'be', 'someone', "'s", 'charioteer', 'but', 'in']
print("Number of words and punctuations in above text = " + str(len(tokens)))
Number of words and punctuations in above text = 94
In above steps, we have converted our sentence into 94
tokens, that is, basically we have divided complete text into parts.
Why we did this? – NLTK requires RAW text to be converted to
NLTK TEXT class type text. For doing so we will be using following function
nltk_text = nltk.Text(tokens) print(nltk_text) print("Type of nltk_text = " + str(type(nltk_text)))
<Text: It is good to be someone 's charioteer...> Type of nltk_text = <class 'nltk.text.Text'>
Now, since we have obtained NLTK “TEXT” class type text, we can proceed with using all of its library functions on it.
# Findiang a word nltk_text.concordance('NLTK')
Displaying 2 of 2 matches: Natural Language Toolkit - 'NLTK ' . NLTK is one of the key libraries which is tural Language Processing in Python . NLTK can be used in a variety of ways imp
# Number of distinct words number_of_distinct_words = len(set(nltk_text)) number_of_distinct_words
# Total number of words total_words=len(nltk_text) total_words
# % Richness of words richness_of_words = ((number_of_distinct_words/ total_words) * 100) print("% Richness of words = " + str(richness_of_words))
% Richness of words = 74.46808510638297
# Dispersion plot for our text plt.figure(figsize=(12, 4)) # to increase default size of figure nltk_text.dispersion_plot(["NLTK", "in", "someone"])
In our next tutorial we will dive further in understanding use of NLTK & Natural Language Processing. So, stay tuned and keep learning.
You can also checkout our interesting video tutorials on YouTube ML For Analytics