DS Concepts DS Languages

Create Words’ Data-Set from RSS Feeds

Create Words’ Data-Set from RSS Feeds

Hi Everyone!! Whenever we try to learn some Machine Learning algorithm, the first thing that comes to our mind is “How we can get live data or real data for testing our algorithm”. This article helps to create words’ data-set from RSS feeds, by extracting data from RSS feeds of multiple websites. Just by adding the URL of different websites that use RSS feed, we will be able extract multiple words from them.

Create words’ data-set from RSS feeds: Advantage

Advantage of doing so is, that we get an authentic data and we can use it for performing multiple Machine Learning algorithms like clustering or unsupervised learning and many more.

Create words’ data-set from RSS feeds: Library

For enabling this functionality, I will be using ‘feedparser’ library of Python, its an open library which helps us to extract data from RSS feeds. You can easily download it for free!

Code to create words’ data-set from RSS feeds is as follows:

[sourcecode language=”python” wraplines=”false” collapse=”false”]
#Import feedparser library and re (regular expression) library
import feedparser
import re

#Creating dictionary of titles and word counts corresponding to a RSS feed

def pullWordCount(url):
#Using feedparser to parse the feed
fed = feedparser.parse(url)
wordCount = {}

for x in fed.entries:
if ‘summary’ in x:
summary = x.summary
else:
summary = x.summary

#Extracting a list of words from feeds
diffWords = pullWords(x.title+ ‘ ‘ + summary)
for words in diffWords:
wordCount.setdefault(words, 0)
wordCount[words]+=1

return fed.feed.title,wordCount
#Removing unnecessary data and refining our data
def pullWords(htmlTag):
#removing all tags of html
txtData = re.compile(r’]+>’).sub(”, htmlTag)

#split words with all the non-alpha characters
words = re.compile(r'[^A-Z^a-z]+’).split(txtData)

#converting all the words to lower case, to create a uniformity
return [word.lower() for word in words if word!=”]
#appearCount has number of times a word has appeared
appearCount = {}
#wordsCount has total words
wordsCount = {}
#testlist.txt contains URLs of websites
for url in open(‘testlist.txt’):
title,wordC = pullWordCount(url)
wordsCount[title] = wordC
for word,count in wordC.items():
appearCount.setdefault(word,0)
if count>1:
appearCount[word]+=1

wordList=[]
for wor,bc in appearCount.items():
percent=float(bc)/len(‘testlist.txt’)
if percent>0.02 and percent<0.8:wordList.append(wor)
#by above percentage we mean that we are using words which have appearance
# percentage between 2% and 80%, you can modify it for different kind of results

#our data will be saved in BlogWordsData.txt
out=open(‘BlogWordsData.txt’,’w’)
out.write(‘TestingBlog’)
for word in wordList: out.write(‘\t%s’ %word)
out.write(‘\n’)
for blog,wc in wordsCount.items():
out.write(blog)
for word in wordList:
if word in wc: out.write(‘t%d’ %wc[word])
else: out.write(‘\t0’)
out.write(‘\n’)
[/sourcecode]

Test List (testlist.txt)

feedlist

Output

blog

In our next tutorial we will use the data extracted from same technique for learning some new techniques in Unsupervised Learning.

Stay tuned and keep learning!!

 

 

Leave a Reply

Back To Top

Discover more from Machine Learning For Analytics

Subscribe now to keep reading and get access to the full archive.

Continue reading