Web Crawler – Part 1

Hi Everyone! Today we will learn about Web Crawler. A Web Crawler is a technique that we use to extract information from a web-page. In this basic crawler we will extract all the website links being present on a web-page. To implement it we will use a Python Library called “BeautifulSoup” .

BeautifulSoup

It is a Python Library, which simplifies the extraction of data from HTML and XML files. It automatically converts all the outputs to UTF-8 convention and all the inputs to Unicode. There are multiple methods in BeautifulSoup which makes it easy to render all the XML data of web-pages. Note: While using BeautifulSoup in Python-3 you may face warnings, and those warnings just terminates the execution of our code, so to suppress them, in your requests made using BeautifulSoup add “lxml”.

The code basic Web Crawler is as follows:

from bs4 import BeautifulSoup

import requests

url = "https://mlforanalytics.com"

r  = requests.get(url)

data = r.text

soup = BeautifulSoup(data, "lxml")

for link in soup.find_all('a'):
    aString = link.get('href')
    #print(aString)
    """Uncomment this line to see all the results we are extracting from page"""
    if aString != None:
        """Added this statement becasue solmetimes we get None type object from websites or some string"""
        if aString.startswith("http"):
            """Extraxting only links starting with http, you can modify it as per requirement"""
            print(aString)

In the above code I have provided the URL of our blog “https://mlforanalytics.com/”, Crawler goes through the Home Page of our website and extracts all the links present on that page.

In our next tutorial we will make some major changes in our crawler, like navigating to other pages from the links being present on first page and save all the links being found on next page and many more. So stay tuned and keep learning!!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s