New Delhi, India

Introduction to NLTK

Introduction to NLTK – Natural Language Toolkit

Hi Everyone! In this article we will learn about Natural Language Toolkit – “NLTK”. NLTK is one of the key libraries which is widely used for Natural Language Processing in Python. NLTK can be used in a variety of ways improving your way of looking at text.

Let’s being with importing NLTK. Then, we will download required packages of NLTK which include prepossessed data as well as data of many renowned books.

In [0]:
import nltk
In [2]:
# Downloading required packages
nltk.download('all')
Out[2]:
True
In [3]:
from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

When we import book from NLTK, it always shows us examples as shown above and if we want to call any one of those sentences, then we can simply type its key in python and it will provide us its data.

In this case keys correspond to “text1”, “text2” and so on.

Lets try calling text1

In [4]:
text2
Out[4]:
<Text: Sense and Sensibility by Jane Austen 1811>

It simply provided me text corresponding to key text2.

Searching data in book/text provided by NLTK

To be specific text2 in it self contains complete book, that is, it contains complete book Sense and Sensibility by Jane Austen 1811

And we can traverse that book and search for words and sentences related to those words.

So, we will be using a function concordance() over text2 and will search a word in it let’s say love. We can do it as shown below.

In [5]:
 text2.concordance("love")
Displaying 25 of 77 matches:
priety of going , and her own tender love for all her three children determine
es ." " I believe you are right , my love ; it will be better that there shoul
 . It implies everything amiable . I love him already ." " I think you will li
sentiment of approbation inferior to love ." " You may esteem him ." " I have 
n what it was to separate esteem and love ." Mrs . Dashwood now took pains to 
oner did she perceive any symptom of love in his behaviour to Elinor , than sh
 how shall we do without her ?" " My love , it will be scarcely a separation .
ise . Edward is very amiable , and I love him tenderly . But yet -- he is not 
ll never see a man whom I can really love . I require so much ! He must have a
ry possible charm ." " Remember , my love , that you are not seventeen . It is
f I do not now . When you tell me to love him as a brother , I shall no more s
hat Colonel Brandon was very much in love with Marianne Dashwood . She rather 
e were ever animated enough to be in love , must have long outlived every sens
hirty - five anything near enough to love , to make him a desirable companion 
roach would have been spared ." " My love ," said her mother , " you must not 
pect that the misery of disappointed love had already been known to him . This
 most melancholy order of disastrous love . CHAPTER 12 As Elinor and Marianne 
hen she considered what Marianne ' s love for him was , a quarrel seemed almos
ctory way ;-- but you , Elinor , who love to doubt where you can -- it will no
 man whom we have all such reason to love , and no reason in the world to thin
ded as he must be of your sister ' s love , should leave her , and leave her p
cannot think that . He must and does love her I am sure ." " But with a strang
 I believe not ," cried Elinor . " I love Willoughby , sincerely love him ; an
or . " I love Willoughby , sincerely love him ; and suspicion of his integrity
deed a man could not very well be in love with either of her daughters , witho

It not only provided us search results for word love, it also provided us some text around our searched word.

Finding similar words in book using NLTK

We can also find similar words from book, like what other words may mean love in whole book or some other word of your choice.

For doing so we will use function similar() over text2 and will provide word “love” just like earlier to fetch similar words.

In [6]:
text2.similar('love')
affection sister heart mother time see town life it dear elinor
marianne me word family her him do regard head

From above result we can see that we are getting words as well as parts of sentences which may represent word love in one way or another. For example “affection”, “dear”, “family”, “regard” etc.

Similarly we can try it for other words too, like “danger”, “laugh” etc.

In [7]:
text2.similar('danger')
house time letter living world furniture visit change family
alteration person it children heart half first whole hope day moment
In [8]:
text2.similar('laugh')
be look manner house person mother day moment concern subject ruin
smile week letter tears and by of in at

So, basically similar() function provides us other words from book which may have been used in context of searched word.

Common words similar to 2 or more words

By using common_contexts([list_of_words]) we can check which words are used as common for 2 or words in same context.

In [9]:
text2.common_contexts(['joy', 'love'])
her_and
In [10]:
text2.common_contexts(['very', 'too'])
was_young was_well was_much know_well is_well it_much him_well
are_good be_much been_much had_much the_great you_far

From above examples we can see that by using function common_contexts(), it is providing us all those words which have been used in similar context of 2 given words.

Set of punctuation & words used

We can also get set of all words & punctuation that are used in a book (‘text’ in this case). To get that you can use set(text) function.

In [11]:
import pandas as pd
temp_df = pd.DataFrame(set(text2))
temp_df.head()
Out[11]:
0
0 troop
1 Bishop
2 ventured
3 eat
4 string
In [12]:
temp_df = pd.DataFrame(sorted(set(text2)))
temp_df.head()
Out[12]:
0
0 !
1 !”
2 !”–
3 !’
4 !'”

We can see that set() function is providing us complete set of words. You can try running just set(text2), it will print complete list.

To get the number of different words or vocabulary including punctuation you can use len() function.

In [13]:
number_of_distinct_elements = len(set(text2))
number_of_distinct_elements
Out[13]:
6833
In [14]:
total_words_in_book = len(text2)
total_words_in_book
Out[14]:
141576

Now, to check the % of distinct words we can simply divide them.

In [15]:
print((number_of_distinct_elements /total_words_in_book) * 100 )
4.826383002768831

It shows a very interesting result to us, that is, complete book comprises only 4.83% distinct words & punctuation. Whole book is just an amalgamation of of such few words.

Checking dispersion of specific words

We can also check dispersion of specific words throughout book on a plot, using dispersion_plot([list_of_words]) function

In [16]:
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 8))     # to increase default size of figure

text2.dispersion_plot(["love", "very", "too", "hate"])
1_into_to_nltk

Every vertical line in above graph, shows us the location of of each word in actual book.

Let’s apply above techniques on our own text.

In [17]:
smaple_words = "It is good to be someone's charioteer but in order to become " \
               "someone's charioteer you have to be someone first, who never " \
               "misses his target. Both are important roles that one has to play " \
               "in this life. Hi Everyone! In this article we will learn about " \
               "Natural Language Toolkit - 'NLTK'. NLTK is one of the key libraries " \
               "which is widely used for Natural Language Processing in Python. " \
               "NLTK can be used in a variety of ways improving your way of looking at text."

smaple_words
Out[17]:
"It is good to be someone's charioteer but in order to become someone's charioteer you have to be someone first, who never misses his target. Both are important roles that one has to play in this life. Hi Everyone! In this article we will learn about Natural Language Toolkit - 'NLTK'. NLTK is one of the key libraries which is widely used for Natural Language Processing in Python. NLTK can be used in a variety of ways improving your way of looking at text."
In [18]:
tokens = nltk.word_tokenize(smaple_words)
tokens[:10]
Out[18]:
['It', 'is', 'good', 'to', 'be', 'someone', "'s", 'charioteer', 'but', 'in']
In [19]:
print("Number of words and punctuations in above text = " + str(len(tokens)))
Number of words and punctuations in above text = 94

In above steps, we have converted our sentence into 94 tokens, that is, basically we have divided complete text into parts.

Why we did this? – NLTK requires RAW text to be converted to NLTK TEXT class type text. For doing so we will be using following function Text(tokens)

In [20]:
nltk_text = nltk.Text(tokens)
print(nltk_text)

print("Type of nltk_text = " + str(type(nltk_text)))
<Text: It is good to be someone 's charioteer...>
Type of nltk_text = <class 'nltk.text.Text'>

Now, since we have obtained NLTK “TEXT” class type text, we can proceed with using all of its library functions on it.

In [21]:
# Findiang a word
nltk_text.concordance('NLTK')
Displaying 2 of 2 matches:
Natural Language Toolkit - 'NLTK ' . NLTK is one of the key libraries which is
tural Language Processing in Python . NLTK can be used in a variety of ways imp
In [22]:
# Number of distinct words
number_of_distinct_words = len(set(nltk_text))
number_of_distinct_words
Out[22]:
70
In [23]:
# Total number of words
total_words=len(nltk_text)
total_words
Out[23]:
94
In [24]:
# % Richness of words
richness_of_words = ((number_of_distinct_words/ total_words) * 100)

print("% Richness of words = " + str(richness_of_words))
% Richness of words = 74.46808510638297
In [25]:
# Dispersion plot for our text

plt.figure(figsize=(12, 4))     # to increase default size of figure
nltk_text.dispersion_plot(["NLTK", "in", "someone"])
2_into_to_nltk

In our next tutorial we will dive further in understanding use of NLTK & Natural Language Processing. So, stay tuned and keep learning.

You can also checkout our interesting video tutorials on YouTube ML For Analytics

 

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: