Introduction to NLTK – Natural Language Toolkit
Hi Everyone! In this article we will learn about Natural Language Toolkit – “NLTK”. NLTK is one of the key libraries which is widely used for Natural Language Processing in Python. NLTK can be used in a variety of ways improving your way of looking at text.
Let’s being with importing NLTK. Then, we will download required packages of NLTK which include prepossessed data as well as data of many renowned books.
import nltk
# Downloading required packages
nltk.download('all')
from nltk.book import *
When we import book from NLTK, it always shows us examples as shown above and if we want to call any one of those sentences, then we can simply type its key in python and it will provide us its data.
In this case keys correspond to “text1”, “text2” and so on.
Lets try calling text1
text2
It simply provided me text corresponding to key text2
.
Searching data in book/text provided by NLTK
To be specific text2
in it self contains complete book, that is, it contains complete book Sense and Sensibility by Jane Austen 1811
And we can traverse that book and search for words and sentences related to those words.
So, we will be using a function concordance()
over text2
and will search a word in it let’s say love
. We can do it as shown below.
text2.concordance("love")
It not only provided us search results for word love
, it also provided us some text around our searched word.
Finding similar words in book using NLTK
We can also find similar words from book, like what other words may mean love
in whole book or some other word of your choice.
For doing so we will use function similar()
over text2
and will provide word “love” just like earlier to fetch similar words.
text2.similar('love')
From above result we can see that we are getting words as well as parts of sentences which may represent word love
in one way or another. For example “affection”, “dear”, “family”, “regard” etc.
Similarly we can try it for other words too, like “danger”, “laugh” etc.
text2.similar('danger')
text2.similar('laugh')
So, basically similar()
function provides us other words from book which may have been used in context of searched word.
Common words similar to 2 or more words
By using common_contexts([list_of_words])
we can check which words are used as common for 2 or words in same context.
text2.common_contexts(['joy', 'love'])
text2.common_contexts(['very', 'too'])
From above examples we can see that by using function common_contexts(), it is providing us all those words which have been used in similar context of 2 given words.
Set of punctuation & words used
We can also get set of all words & punctuation that are used in a book (‘text’ in this case). To get that you can use set(text)
function.
import pandas as pd
temp_df = pd.DataFrame(set(text2))
temp_df.head()
temp_df = pd.DataFrame(sorted(set(text2)))
temp_df.head()
We can see that set() function is providing us complete set of words. You can try running just set(text2), it will print complete list.
To get the number of different words or vocabulary including punctuation you can use len()
function.
number_of_distinct_elements = len(set(text2))
number_of_distinct_elements
total_words_in_book = len(text2)
total_words_in_book
Now, to check the % of distinct words we can simply divide them.
print((number_of_distinct_elements /total_words_in_book) * 100 )
It shows a very interesting result to us, that is, complete book comprises only 4.83%
distinct words & punctuation. Whole book is just an amalgamation of of such few words.
Checking dispersion of specific words
We can also check dispersion of specific words throughout book on a plot, using dispersion_plot([list_of_words])
function
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 8)) # to increase default size of figure
text2.dispersion_plot(["love", "very", "too", "hate"])
Every vertical line in above graph, shows us the location of of each word in actual book.
Let’s apply above techniques on our own text.
smaple_words = "It is good to be someone's charioteer but in order to become " \
"someone's charioteer you have to be someone first, who never " \
"misses his target. Both are important roles that one has to play " \
"in this life. Hi Everyone! In this article we will learn about " \
"Natural Language Toolkit - 'NLTK'. NLTK is one of the key libraries " \
"which is widely used for Natural Language Processing in Python. " \
"NLTK can be used in a variety of ways improving your way of looking at text."
smaple_words
tokens = nltk.word_tokenize(smaple_words)
tokens[:10]
print("Number of words and punctuations in above text = " + str(len(tokens)))
In above steps, we have converted our sentence into 94 tokens
, that is, basically we have divided complete text into parts.
Why we did this? – NLTK requires RAW text to be converted to NLTK TEXT class type text
. For doing so we will be using following function Text(tokens)
nltk_text = nltk.Text(tokens)
print(nltk_text)
print("Type of nltk_text = " + str(type(nltk_text)))
Now, since we have obtained NLTK “TEXT” class type text, we can proceed with using all of its library functions on it.
# Findiang a word
nltk_text.concordance('NLTK')
# Number of distinct words
number_of_distinct_words = len(set(nltk_text))
number_of_distinct_words
# Total number of words
total_words=len(nltk_text)
total_words
# % Richness of words
richness_of_words = ((number_of_distinct_words/ total_words) * 100)
print("% Richness of words = " + str(richness_of_words))
# Dispersion plot for our text
plt.figure(figsize=(12, 4)) # to increase default size of figure
nltk_text.dispersion_plot(["NLTK", "in", "someone"])
In our next tutorial we will dive further in understanding use of NLTK & Natural Language Processing. So, stay tuned and keep learning.
You can also checkout our interesting video tutorials on YouTube ML For Analytics