Creating Words’ Data-Set from RSS Feeds

Hi Everyone!! Whenever we try to learn some Machine Learning algorithm, the first thing that comes to our mind is “How we can get live data or real data for testing our algorithm”. This article focuses on creating such data set only, by extracting data from RSS feeds of multiple websites. Just by adding the URL of different websites that use RSS feed, we will be able extract multiple words from them.

Advantage of doing so is, that we get an authentic data and we can use it for performing multiple Machine Learning algorithms like clustering or unsupervised learning and many more.

For enabling this functionality, I will be using ‘feedparser’ library of Python, its an open library which helps us to extract data from RSS feeds. You can easily download it for free!

Code for it is as follows:

#Import feedparser library and re (regular expression) library
import feedparser
import re

#Creating dictionary of titles and word counts corresponding to a RSS feed

def pullWordCount(url):
    #Using feedparser to parse the feed
    fed = feedparser.parse(url)
    wordCount = {}

    for x in fed.entries:
        if 'summary' in x:
            summary = x.summary
        else:
            summary = x.summary

        #Extracting a list of words from feeds
        diffWords = pullWords(x.title+ ' ' + summary)
        for words in diffWords:
            wordCount.setdefault(words, 0)
            wordCount[words]+=1

    return fed.feed.title,wordCount
#Removing unnecessary data and refining our data
def pullWords(htmlTag):
    #removing all tags of html
    txtData = re.compile(r']+>').sub('', htmlTag)

    #split words with all the non-alpha characters
    words = re.compile(r'[^A-Z^a-z]+').split(txtData)

    #converting all the words to lower case, to create a uniformity
    return [word.lower() for word in words if word!='']
#appearCount has number of times a word has appeared
appearCount = {}
#wordsCount has total words
wordsCount = {}
#testlist.txt contains URLs of websites
for url in open('testlist.txt'):
    title,wordC = pullWordCount(url)
    wordsCount[title] = wordC
    for word,count in wordC.items():
        appearCount.setdefault(word,0)
        if count>1:
            appearCount[word]+=1

wordList=[]
for wor,bc in appearCount.items():
    percent=float(bc)/len('testlist.txt')
    if percent>0.02 and percent<0.8:wordList.append(wor)
#by above percentage we mean that we are using words which have appearance
# percentage between 2% and 80%, you can modify it for different kind of results

#our data will be saved in BlogWordsData.txt 
out=open('BlogWordsData.txt','w')
out.write('TestingBlog')
for word in wordList: out.write('\t%s' %word)
out.write('\n')
for blog,wc in wordsCount.items():
    out.write(blog)
    for word in wordList:
        if word in wc: out.write('t%d' %wc[word])
    else: out.write('\t0')
    out.write('\n')

Test List (testlist.txt)

feedlist

Output

blog

In our next tutorial we will use the data extracted from same technique for learning some new techniques in Unsupervised Learning.

Stay tuned and keep learning!!

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s