Web Crawler – Part 1
Hi Everyone! Today we will learn about Web Crawler. A Web Crawler is a technique that we use to extract information from a web-page. In this basic crawler we will extract all the website links being present on a web-page. To implement it we will use a Python Library called “BeautifulSoup” .
Library for Web crawler: BeautifulSoup
It is a Python Library, which simplifies the extraction of data from HTML and XML files. It automatically converts all the outputs to UTF-8 convention and all the inputs to Unicode. There are multiple methods in BeautifulSoup which makes it easy to render all the XML data of web-pages. Note: While using BeautifulSoup in Python-3 you may face warnings, and those warnings just terminates the execution of our code, so to suppress them, in your requests made using BeautifulSoup add “lxml”.
The code basic Web Crawler is as follows:
[sourcecode language=”python” wraplines=”false” collapse=”false”]
from bs4 import BeautifulSoup
import requests
url = “https://atomic-temporary-144721188.wpcomstaging.com”
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, “lxml”)
for link in soup.find_all(‘a’):
aString = link.get(‘href’)
#print(aString)
“””Uncomment this line to see all the results we are extracting from page”””
if aString != None:
“””Added this statement becasue solmetimes we get None type object from websites or some string”””
if aString.startswith(“http”):
“””Extraxting only links starting with http, you can modify it as per requirement”””
print(aString)
[/sourcecode]
In the above code I have provided the URL of our blog “https://atomic-temporary-144721188.wpcomstaging.com/”, Crawler goes through the Home Page of our website and extracts all the links present on that page.
In our next tutorial we will make some major changes in our crawler, like navigating to other pages from the links being present on first page and save all the links being found on next page and many more. So stay tuned and keep learning!!