Web Scraping with python

What is Web Scraping?

Web Scraping is said to the extraction of data from a website. This information than collected in files like, .csv, .json, .xlsm (excel)

Libraries we will use

There are two libraries we will use to run our code successfully. those two libraries are,

  • Requests
  • Beautifulsoup4

We will be using this two for beginning of Web Scraping.

Use of Requests module

Requests module in python is use to send all kind of HTTP requests. It is very easy to use, by defining the URL it gives many data on your command. you will learn shortly.

Use of Beautifulsoup module

Beautifulsoup module responsible for dragging data from HTML or XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. We will see it further.

Time to learn code

We will use http://toscrape.com for our practice for Web Scraping. Get more site for practice purpose, Click Here.

Here we go, 1st code

import requests
reso = requests.get("http://toscrape.com/")
x = reso.status_code
print(x)
Output->
200
when you get 200 as output, it means you request succeed.

If you are facing and problem contact and ask questions in the comments below.

let’s push ourselves more for 2nd step, now will we use Beautifulsoup4, here we go

import requests
from bs4 import BeautifulSoup
reso = requests.get("http://toscrape.com/")
soup = BeautifulSoup(reso.content,"html.parser")
print(soup)
Output
<!DOCTYPE html>

<html lang="en">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Scraping Sandbox</title>
<link href="./css/bootstrap.min.css" rel="stylesheet"/>
<link href="./css/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row">
<div class="col-md-1"></div>
<div class="col-md-10 well">
<img class="logo" src="img/zyte.png" width="200px"/>
<h1 class="text-right">Web Scraping Sandbox</h1>
</div>
</div>
<div class="row">
<div class="col-md-1"></div>
<div class="col-md-10">
<h2>Books</h2>
<p>A <a href="http://books.toscrape.com">fictional bookstore</a> that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. Available at: <a href="http://books.toscrape.com">books.toscrape.com</a></p>
<div class="col-md-6">
<a href="http://books.toscrape.com"><img class="img-thumbnail" src="./img/books.png"/></a>
</div>
<div class="col-md-6">
<table class="table table-hover">
<tr><th colspan="2">Details</th></tr>
<tr><td>Amount of items </td><td>1000</td></tr>
<tr><td>Pagination </td><td>✔</td></tr>
<tr><td>Items per page </td><td>max 20</td></tr>
<tr><td>Requires JavaScript </td><td>✘</td></tr>
</table>
</div>
</div>
</div>
<div class="row">
<div class="col-md-1"></div>
<div class="col-md-10">
<h2>Quotes</h2>
<p><a href="http://quotes.toscrape.com/">A website</a> that lists quotes from famous people. It has many endpoints showing the quotes in many different ways, each of them including new scraping challenges for you, as described below.</p>
<div class="col-md-6">
<a href="http://quotes.toscrape.com"><img class="img-thumbnail" src="./img/quotes.png"/></a>
</div>
<div class="col-md-6">
<table class="table table-hover">
<tr><th colspan="2">Endpoints</th></tr>
<tr><td><a href="http://quotes.toscrape.com/">Default</a></td><td>Microdata and pagination</td></tr>
<tr><td><a href="http://quotes.toscrape.com/scroll">Scroll</a> </td><td>infinite scrolling pagination</td></tr>
<tr><td><a href="http://quotes.toscrape.com/js">JavaScript</a> </td><td>JavaScript generated content</td></tr>
<tr><td><a href="http://quotes.toscrape.com/js-delayed">Delayed</a> </td><td>Same as JavaScript but with a delay (?delay=10000)</td></tr>
<tr><td><a href="http://quotes.toscrape.com/tableful">Tableful</a> </td><td>a table based messed-up layout</td></tr>
<tr><td><a href="http://quotes.toscrape.com/login">Login</a> </td><td>login with CSRF token (any user/passwd works)</td></tr>
<tr><td><a href="http://quotes.toscrape.com/search.aspx">ViewState</a> </td><td>an AJAX based filter form with ViewStates</td></tr>
<tr><td><a href="http://quotes.toscrape.com/random">Random</a> </td><td>a single random quote</td></tr>
</table>
</div>
</div>
</div>
</div>
</body>
</html>

Here we got the HTML code of the site. I know we can get by help of Chrome for other browsers to click right-click on any site and click on view page source code. But wait it’s not over yet. We can have more fun that Chrome or other browsers won’t give.

Let’s move further,

import requests
from bs4 import BeautifulSoup
reso = requests.get("http://toscrape.com/")
soup = BeautifulSoup(reso.content,"html.parser")
print(soup.title)
Output
<title>Scraping Sandbox</title>

Here we made a small change, that was soup.title which gives output as “<title>Scraping Sandbox</title>” which is used to give a title to any site.

But now I don’t want those commands in my beautiful output, so what to do? is there any other way to do that? well of course we do have, Beautifulsoup provides the use of those features too. Here we go,

import requests
from bs4 import BeautifulSoup
reso = requests.get("http://toscrape.com/")
soup = BeautifulSoup(reso.content,"html.parser")
print(soup.title.string)
Output
Scraping Sandbox

And Boom, No unnecessary codes of HTML by just adding string after calling title from the soup variable.

Now let’s get all the information in a paragraph (<p>). Here we go,

import requests
from bs4 import BeautifulSoup
reso = requests.get("http://toscrape.com/")
soup = BeautifulSoup(reso.content,"html.parser")
print(soup.find_all("p"))

Output

[<p>A <a href="http://books.toscrape.com">fictional bookstore</a> that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. Available at: <a href="http://books.toscrape.com">books.toscrape.com</a></p>, <p><a href="http://quotes.toscrape.com/">A website</a> that lists quotes from famous people. It has many endpoints showing the quotes in many different ways, each of them including new scraping challenges for you, as described below.</p>]

These were the small data we were extracting. We know that the actual data always there in a table (<table>). So let’s extract the data from the table now. Here we go,

import requests
from bs4 import BeautifulSoup

reso = requests.get("http://toscrape.com/")
soup = BeautifulSoup(reso.content,"html.parser")
print(soup.find("table"))

Output

<table class="table table-hover">
<tr><th colspan="2">Details</th></tr>
<tr><td>Amount of items </td><td>1000</td></tr>
<tr><td>Pagination </td><td>✔</td></tr>
<tr><td>Items per page </td><td>max 20</td></tr>
<tr><td>Requires JavaScript </td><td>✘</td></tr>
</table>

Here we can see the actual data which is present in a table. But I am still not satisfied guys 🙁

I want the data which is in table data (<td>), is it possible? of course possible if Beautifulsoup is here with us but now we have to use For Loops. Let’s move ahead,

import requests
from bs4 import BeautifulSoup

reso = requests.get("http://toscrape.com/")
soup = BeautifulSoup(reso.content,"html.parser")

table = soup.find("table")

table_data = table.find_all("td")
for td in table_data:
print(td.get_text())

Output

Amount of items 
1000
Pagination

Items per page
max 20
Requires JavaScript

This is all you have to do in beginning don’t worry I will come up with advanced web scraping too. Stay tuned, turn on notifications to read a new and fresh article, or check out my Instagram page.

Default image
Meet Nakum

Leave a Reply