Sec-Art: Web Scrapping with Python using BeautifulSoup module

Web Scrapping (also known as web harvesting or web data extraction ) is a technique used to extract information and data from websites. In this tutorial we are going to use a python module called beautifulSoup for web scrapping. It is very powerful python library for extract data from HTML and XML files. But note that BeautifulSoup does not send any page requests to website, so we have to do this by using other modules like urllib2, requests etc. Now first we need to install BeautifulSoup Library. To install the BeautifulSoup module use the below command :

 sudo apt-get install python-bs4

or you can also install with pip

 pip install beautifulsoup4

or in-order to install BeautifulSoup from source, Download the source from here

https://www.crummy.com/software/BeautifulSoup/bs4/download/4.6/

Then extract it and run the setup.py file

 tar xvf beautifulsoup4-x.x.tar.gz
 python setup.py install

Installing HTML Parser :

In Beautiful Soup we need to suppy an html parser to process the data. With python an html parser ('html.parser') comes built in, But we can also instal some more powerfull html parsers like xml, html5lib etc. To install below parsers use the below command :

 sudo apt-get install python-lxml
 sudo apt-get install python-html5lib

 pip install lxml
 pip install html5lib

The lxml parser is very fast and can be used to quickly parse given HTML. But the html5lib is a bit slow as compared to lxml, but it is also very useful parser. To clearify the difference between all the parsers, try the below code :

 $ python
 
 >>> from bs4 import BeautifulSoup
 >>> code = """<html>
 <HEAD>
 <title>This is test
 </HEAD>
 <body>
 <p>Hello world This is test</p>
 </html>"""

html.pareser :

 >>> psr1 = BeautifulSoup(code, 'html.parser')
 >>> print psr1
 <html>
 <head>
 <title>This is test
 </title></head>
 <body>
 <p>Hello world This is test</p>
 </body></html>

xml parser :

 >>> psr2 = BeautifulSoup(code, 'xml')
 >>> print psr2
 <?xml version="1.0" encoding="utf-8"?>
 <html>
 <HEAD>
 <title>This is test
 </title>
 <body>
 <p>Hello world This is test</p>
 </body></HEAD></html>

lxml parser :

 >>> psr3 = BeautifulSoup(code, 'lxml')
 >>> print psr3
 <html>
 <head>
 <title>This is test
 </title></head>
 <body>
 <p>Hello world This is test</p>
 </body></html>

html5lib parser :

Basics of beautifulSoup :

Now here are some basic example of how to use BeautifulSoup Library for Web Scrapping, we are taking a simple html code for demonstration :

 <html>
 <head>
  <title>Simple Web Page</title>
 </head>
 <body>
  <p class='heading'>Hello world : A Sample Page test</p>
  <p class='sub-heading'>This the Sub-heading or Description</p>
  <table>
    <tr>
   <th>Firstname</th><th>Lastname</th> <th>Age</th>
  </tr>
  <tr>
   <td>Frank</td><td>Castle</td><td>40</td>
    </tr>
    <tr>
   <td>Jack</td><td>Rietcher</td><td>45</td>
    </tr>
  </table>
  <a href="http://wikipedia.org/">Wikipedia</a>
  <a href="http://youtube.com">Watch Videos</a>
  <a href="http://google.com">Search Something</a>
  <a href="http://facebook.com">Find Your Friends</a>
 <body>
</html>

Store the above code in a variable in python interpreter

 code = """<html>
 <head>
  <title>Simple Web Page</title>
 </head>
 <body>
  <p class='heading'>Hello world : A Sample Page test</p>
  <p class='sub-heading'>This the Sub-heading or Description</p>
  <table>
    <tr>
    <th>Firstname</th><th>Lastname</th> <th>Age</th>
  </tr>
  <tr>
   <td>Frank</td><td>Castle</td><td>40</td>
    </tr>
    <tr>
   <td>Jack</td><td>Rietcher</td><td>45</td>
    </tr>
  </table>
  <a href="http://wikipedia.org/">Wikipedia</a>
  <a href="http://youtube.com">Watch Videos</a>
  <a href="http://google.com">Search Something</a>
  <a href="http://facebook.com">Find Your Friends</a>
 <body>
</html>"""

Now import the BeautifulSoup library :

 from bs4 import BeautifulSoup

Now create a bs4 object to parse the data, we going to use lxml parser in this example :

 soup = BeautifulSoup(code, 'lxml')

Now with the soup objects we can access all elements of the html page throgh tag names and its attributes. For example to get the title of page

 >>> soup.title
 <title>Simple Web Page</title>

 >>> soup.title.string
 u'Simple Web Page'

 >>> soup.title.name
 'title'

To get the paragraph :

  >>> soup.p
 <p class="heading">Hello world : A Sample Page test</p>
 
 >>> print soup.p.contents
 [u'Hello world : A Sample Page test']

 >>> soup.p.string
 u'Hello world : A Sample Page test'

To get all paragraph

  >>> soup.find_all('p')
 [<p class="heading">Hello world : A Sample Page test</p>, <p class="sub-heading">This the Sub-heading or Description</p>]

Extracting table information

   >>> soup.body.table
 <table>\n<tr>\n<th>Firstname</th><th>Lastname</th> <th>Age</th>\n</tr>\n<tr>\n<td>Frank</td><td>Castle</td><td>40</td>
 \n</tr>\n<tr>\n<td>Jack</td><td>Rietcher</td><td>45</td>\n</tr>\n</table>
 
 >>> soup.body.table.tr
 <tr>\n<th>Firstname</th><th>Lastname</th> <th>Age</th>\n</tr>
 
 >>> soup.body.table.tr.th
 <th>Firstname</th>
 
 >>> soup.body.table.tr.find_all('th')
 [<th>Firstname</th>, <th>Lastname</th>, <th>Age</th>]

printing all rows data in table

 >>> for dat in soup.body.table.find_all('tr'):
 ...      for ind in dat.find_all('td'):
 ...             print ind.string
 ... 
 Frank
 Castle
 40
 Jack
 Rietcher
 45

Getting link information

  >>> soup.body.a
 <a href="http://wikipedia.org/">Wikipedia</a>
 
 >>> soup.body.find_all('a')
 [<a href="http://wikipedia.org/">Wikipedia</a>, <a href="http://youtube.com">Watch Videos</a>, <a href="http://google.com">Search Something</a>, 
 <a href="http://facebook.com">Find Your Friends</a>]

print all the links :

 >>> for link in soup.body.find_all('a'):
 ...     print link['href']
 ... 
 http://wikipedia.org/
 http://youtube.com
 http://google.com
 http://facebook.com

With next_sibling and previous_sibling we can navigate between page elements that are on the same level

 >>> soup.body.p
 <p class="heading">Hello world : A Sample Page test</p>

 >>> soup.body.p.next_sibling
 u'\n'

 >>> soup.body.p.next_sibling.next_sibling
 <p class="sub-heading">This the Sub-heading or Description</p>

Parsing a Webpage with urllib2 and BeautifulSoup :

first imoprt all the necessary libraries

 from bs4 import BeautifulSoup
 import urllib2

Now to get the page, send the get request to the page url https://www.w3.org, and parse it with BeautifulSoup

 page = urllib2.urlopen("https://www.w3.org")
 soup = BeautifulSoup(page, 'lxml')

Now get the title of the page

 >>> soup.title.string
 u'World Wide Web Consortium (W3C)'

 >>> soup.body.p
 <p class="bct"><span class="skip"><a accesskey="2" href="#w3c_most-recently" tabindex="1" title="Skip to content (e.g., when browsing via audio)">Skip</a></span></p>

Harvesting all the links

 >>> for link in soup.find_all('a'):
 ...     print link['href']
 ... 
 /
 /standards/
 /participate/
 ------------
 -----------
 https://www.w3.org/WAI/videos/standards-and-benefits.html
 https://www.w3.org/WAI/videos/standards-and-benefits.html
 http://lists.w3.org/Archives/Public/site-comments/
 http://twitter.com/W3C
 http://www.csail.mit.edu/
 http://www.ercim.eu/
 http://www.keio.ac.jp/
 http://ev.buaa.edu.cn/

Printing all the text from page :

 print (soup.get_text())

And with like the above we can collect information by parsing the web pages.

Conclusion :

The BeautifulSoup is very powerful library to parse the HTML and XML documents and collecting data from it. At above we saw some very basic example of how to use it. And For more detail information please check the documentation page at here : Official Documentation