Web Scrapping (also known as web harvesting or web data extraction ) is a technique used to extract information and data from websites. In this tutorial we are going to use a python module called beautifulSoup for web scrapping. It is very powerful python library for extract data from HTML and XML files. But note that BeautifulSoup does not send any page requests to website, so we have to do this by using other modules like urllib2, requests etc. Now first we need to install BeautifulSoup Library. To install the BeautifulSoup module use the below command :
https://www.crummy.com/software/BeautifulSoup/bs4/download/4.6/
Then extract it and run the setup.py file
Installing HTML Parser :
In Beautiful Soup we need to suppy an html parser to process the data. With python an html parser ('html.parser') comes built in, But we can also instal some more powerfull html parsers like xml, html5lib etc. To install below parsers use the below command :
Basics of beautifulSoup :
Now here are some basic example of how to use BeautifulSoup Library for Web Scrapping, we are taking a simple html code for demonstration :
Parsing a Webpage with urllib2 and BeautifulSoup :
first imoprt all the necessary libraries
Harvesting all the links
Conclusion :
The BeautifulSoup is very powerful library to parse the HTML and XML documents and collecting data from it. At above we saw some very basic example of how to use it. And For more detail information please check the documentation page at here : Official Documentation
sudo apt-get install python-bs4
or you can also install with pip pip install beautifulsoup4
or in-order to install BeautifulSoup from source, Download the source from herehttps://www.crummy.com/software/BeautifulSoup/bs4/download/4.6/
Then extract it and run the setup.py file
tar xvf beautifulsoup4-x.x.tar.gz
python setup.py install
Installing HTML Parser :
In Beautiful Soup we need to suppy an html parser to process the data. With python an html parser ('html.parser') comes built in, But we can also instal some more powerfull html parsers like xml, html5lib etc. To install below parsers use the below command :
sudo apt-get install python-lxml
sudo apt-get install python-html5lib
or pip install lxml
pip install html5lib
The lxml parser is very fast and can be used to quickly parse given HTML. But the html5lib is a bit slow as compared to lxml, but it is also very useful parser. To clearify the difference between all the parsers, try the below code : $ python
>>> from bs4 import BeautifulSoup
>>> code = """<html>
<HEAD>
<title>This is test
</HEAD>
<body>
<p>Hello world This is test</p>
</html>"""
html.pareser : >>> psr1 = BeautifulSoup(code, 'html.parser')
>>> print psr1
<html>
<head>
<title>This is test
</title></head>
<body>
<p>Hello world This is test</p>
</body></html>
xml parser : >>> psr2 = BeautifulSoup(code, 'xml')
>>> print psr2
<?xml version="1.0" encoding="utf-8"?>
<html>
<HEAD>
<title>This is test
</title>
<body>
<p>Hello world This is test</p>
</body></HEAD></html>
lxml parser : >>> psr3 = BeautifulSoup(code, 'lxml')
>>> print psr3
<html>
<head>
<title>This is test
</title></head>
<body>
<p>Hello world This is test</p>
</body></html>
html5lib parser :Basics of beautifulSoup :
Now here are some basic example of how to use BeautifulSoup Library for Web Scrapping, we are taking a simple html code for demonstration :
<html>
<head>
<title>Simple Web Page</title>
</head>
<body>
<p class='heading'>Hello world : A Sample Page test</p>
<p class='sub-heading'>This the Sub-heading or Description</p>
<table>
<tr>
<th>Firstname</th><th>Lastname</th> <th>Age</th>
</tr>
<tr>
<td>Frank</td><td>Castle</td><td>40</td>
</tr>
<tr>
<td>Jack</td><td>Rietcher</td><td>45</td>
</tr>
</table>
<a href="http://wikipedia.org/">Wikipedia</a>
<a href="http://youtube.com">Watch Videos</a>
<a href="http://google.com">Search Something</a>
<a href="http://facebook.com">Find Your Friends</a>
<body>
</html>
Store the above code in a variable in python interpreter code = """<html>
<head>
<title>Simple Web Page</title>
</head>
<body>
<p class='heading'>Hello world : A Sample Page test</p>
<p class='sub-heading'>This the Sub-heading or Description</p>
<table>
<tr>
<th>Firstname</th><th>Lastname</th> <th>Age</th>
</tr>
<tr>
<td>Frank</td><td>Castle</td><td>40</td>
</tr>
<tr>
<td>Jack</td><td>Rietcher</td><td>45</td>
</tr>
</table>
<a href="http://wikipedia.org/">Wikipedia</a>
<a href="http://youtube.com">Watch Videos</a>
<a href="http://google.com">Search Something</a>
<a href="http://facebook.com">Find Your Friends</a>
<body>
</html>"""
Now import the BeautifulSoup library : from bs4 import BeautifulSoup
Now create a bs4 object to parse the data, we going to use lxml parser in this example : soup = BeautifulSoup(code, 'lxml')
Now with the soup objects we can access all elements of the html page throgh tag names and its attributes. For example to get the title of page >>> soup.title
<title>Simple Web Page</title>
>>> soup.title.string
u'Simple Web Page'
>>> soup.title.name
'title'
To get the paragraph : >>> soup.p
<p class="heading">Hello world : A Sample Page test</p>
>>> print soup.p.contents
[u'Hello world : A Sample Page test']
>>> soup.p.string
u'Hello world : A Sample Page test'
To get all paragraph >>> soup.find_all('p')
[<p class="heading">Hello world : A Sample Page test</p>, <p class="sub-heading">This the Sub-heading or Description</p>]
Extracting table information >>> soup.body.table
<table>\n<tr>\n<th>Firstname</th><th>Lastname</th> <th>Age</th>\n</tr>\n<tr>\n<td>Frank</td><td>Castle</td><td>40</td>
\n</tr>\n<tr>\n<td>Jack</td><td>Rietcher</td><td>45</td>\n</tr>\n</table>
>>> soup.body.table.tr
<tr>\n<th>Firstname</th><th>Lastname</th> <th>Age</th>\n</tr>
>>> soup.body.table.tr.th
<th>Firstname</th>
>>> soup.body.table.tr.find_all('th')
[<th>Firstname</th>, <th>Lastname</th>, <th>Age</th>]
printing all rows data in table >>> for dat in soup.body.table.find_all('tr'):
... for ind in dat.find_all('td'):
... print ind.string
...
Frank
Castle
40
Jack
Rietcher
45
Getting link information >>> soup.body.a
<a href="http://wikipedia.org/">Wikipedia</a>
>>> soup.body.find_all('a')
[<a href="http://wikipedia.org/">Wikipedia</a>, <a href="http://youtube.com">Watch Videos</a>, <a href="http://google.com">Search Something</a>,
<a href="http://facebook.com">Find Your Friends</a>]
print all the links : >>> for link in soup.body.find_all('a'):
... print link['href']
...
http://wikipedia.org/
http://youtube.com
http://google.com
http://facebook.com
With next_sibling and previous_sibling we can navigate between page elements that are on the same level >>> soup.body.p
<p class="heading">Hello world : A Sample Page test</p>
>>> soup.body.p.next_sibling
u'\n'
>>> soup.body.p.next_sibling.next_sibling
<p class="sub-heading">This the Sub-heading or Description</p>
Parsing a Webpage with urllib2 and BeautifulSoup :
first imoprt all the necessary libraries
from bs4 import BeautifulSoup
import urllib2
Now to get the page, send the get request to the page url https://www.w3.org, and parse it with BeautifulSoup page = urllib2.urlopen("https://www.w3.org")
soup = BeautifulSoup(page, 'lxml')
Now get the title of the page >>> soup.title.string
u'World Wide Web Consortium (W3C)'
>>> soup.body.p
<p class="bct"><span class="skip"><a accesskey="2" href="#w3c_most-recently" tabindex="1" title="Skip to content (e.g., when browsing via audio)">Skip</a></span></p>
Harvesting all the links
>>> for link in soup.find_all('a'):
... print link['href']
...
/
/standards/
/participate/
------------
-----------
https://www.w3.org/WAI/videos/standards-and-benefits.html
https://www.w3.org/WAI/videos/standards-and-benefits.html
http://lists.w3.org/Archives/Public/site-comments/
http://twitter.com/W3C
http://www.csail.mit.edu/
http://www.ercim.eu/
http://www.keio.ac.jp/
http://ev.buaa.edu.cn/
Printing all the text from page : print (soup.get_text())
And with like the above we can collect information by parsing the web pages.Conclusion :
The BeautifulSoup is very powerful library to parse the HTML and XML documents and collecting data from it. At above we saw some very basic example of how to use it. And For more detail information please check the documentation page at here : Official Documentation