Creating Words’ Data-Set from RSS Feeds

Hi Everyone!! Whenever we try to learn some Machine Learning algorithm, the first thing that comes to our mind is “How we can get live data or real data for testing our algorithm”. This article focuses on creating such data set only, by extracting data from RSS feeds of multiple websites. Just by adding the URL of different websites that use RSS feed, we will be able extract multiple words from them.

Advantage of doing so is, that we get an authentic data and we can use it for performing multiple Machine Learning algorithms like clustering or unsupervised learning and many more.

For enabling this functionality, I will be using ‘feedparser’ library of Python, its an open library which helps us to extract data from RSS feeds. You can easily download it for free!

Code for it is as follows:

[sourcecode language=”python” wraplines=”false” collapse=”false”]
#Import feedparser library and re (regular expression) library
import feedparser
import re

#Creating dictionary of titles and word counts corresponding to a RSS feed

def pullWordCount(url):
#Using feedparser to parse the feed
fed = feedparser.parse(url)
wordCount = {}

for x in fed.entries:
if ‘summary’ in x:
summary = x.summary
else:
summary = x.summary

#Extracting a list of words from feeds
diffWords = pullWords(x.title+ ‘ ‘ + summary)
for words in diffWords:
wordCount.setdefault(words, 0)
wordCount[words]+=1

return fed.feed.title,wordCount
#Removing unnecessary data and refining our data
def pullWords(htmlTag):
#removing all tags of html
txtData = re.compile(r’]+>’).sub(”, htmlTag)

#split words with all the non-alpha characters
words = re.compile(r'[^A-Z^a-z]+’).split(txtData)

#converting all the words to lower case, to create a uniformity
return [word.lower() for word in words if word!=”]
#appearCount has number of times a word has appeared
appearCount = {}
#wordsCount has total words
wordsCount = {}
#testlist.txt contains URLs of websites
for url in open(‘testlist.txt’):
title,wordC = pullWordCount(url)
wordsCount[title] = wordC
for word,count in wordC.items():
appearCount.setdefault(word,0)
if count>1:
appearCount[word]+=1

wordList=[]
for wor,bc in appearCount.items():
percent=float(bc)/len(‘testlist.txt’)
if percent>0.02 and percent<0.8:wordList.append(wor)
#by above percentage we mean that we are using words which have appearance
# percentage between 2% and 80%, you can modify it for different kind of results

#our data will be saved in BlogWordsData.txt
out=open('BlogWordsData.txt','w')
out.write('TestingBlog')
for word in wordList: out.write('\t%s' %word)
out.write('\n')
for blog,wc in wordsCount.items():
out.write(blog)
for word in wordList:
if word in wc: out.write('t%d' %wc[word])
else: out.write('\t0')
out.write('\n')
[/sourcecode]