This article is a detailed explanation about the Web Scraping in Python using BeautifulSoup. The prerequisite for this article is Python and Pandas.
- Python Beautifulsoup
- Web Scraping In Python Using Beautifulsoup
- Web Scraping In Python Using Beautifulsoup Data
- Beautifulsoup
Web scraping requires a little knowledge of HTML also, so if you know it already it then it is good, otherwise don’t worry I’ll cover the required topics of HTML.
First, we’ll talk about Web Scraping, then we’ll look into the BeautifulSoup, and in the end, we’ll take an example.
The incredible amount of data on the Internet is a rich resource for any field of research or personal interest. To effectively harvest that data, you’ll need to become skilled at web scraping. The Python libraries requests and Beautiful Soup are powerful tools for the job. One of the most challenging tasks in web scraping is being able to login automatically and extract data within your account in that website. In this tutorial, you will learn how you can extract all forms from web pages as well as filling and submitting them using requestshtml and BeautifulSoup libraries.
Why do we need Web Scraping?
Suppose someone asks you to get the list of Top 100 Movies and all the details like year, ratings, directors, and actors of the movies then what you’ll do?
First, you’ll search for Top 100 Movies in google, then open the first link (maybe IMDB) and start to copy-pasting the list and the details, this seems a bad idea. What if you have a script or program that takes the URL of the website and extracts all the required information from it.
Similarly, there might be hundreds of websites which have relevant information for you, some of them have static information and some of them have changing information like sports site, news site, etc. In today’s world, digital information is very important and highly valuable.
Web Scraping provides a way to automate the information extraction from the given website(s).
What it is?
Web Scraping is nothing but automated data extraction from the website(s) and after extraction, this data is processed and converted to useful information.
Websites are the collection of the Web Pages, where each web page is built using the text-based markup languages like HTML and XHTML.
The web pages contain useful information in the text form, but they are designed for the humans as the end-users, therefore it requires special tools to automate the information extraction.
Web Scraper tool uses the HTML structural elements (div, span, p, a, etc) and the attributes (id, class) of the web page to extract the text information.
Now before moving towards BeautifulSoup, first let’s take a brief look into some HTML basics.
HTML Basics.
In order to extract the information, first we need to get the insight into the structure of the web page, this will tell us which section of HTML holding particular information.
For better understating of the web page’s structure, we need to know some of the basics of HTML.
Elements.
An HTML element contains 3 parts, start tag, some content, and the end tag.
There are several types of elements present in HTML, all of them has a different purpose and usage. Each type of element is uniquely identified by its tag-name. Elements can also be used in a nested fashion.
- h1: This is used for heading, this displays the heading in the biggest size. h2, h3, h4, h5, and h6 are some other heading elements.
- p: Used for the paragraph.
- a: It is used to provide the hyperlinks.
- div: Defines the division or section.
- span: This is used to the grouping of inline-elements.
Attributes.
An HTML element may or may not have the attribute(s). These attributes provide additional information about the element. We only talk about two attributes, Class and Id.
- Class: The HTML class attribute is used to define equal styles for elements with the same class name.
- Id: The id attribute specifies a unique id for an HTML element (the value must be unique within the HTML document).
I have tried to give a brief about some components of HTML, if you are still having doubts or want to explore more then check out w3schools
Ok, so let’s dive into BeautifulSoup, a beautiful tool that makes web scraping super easy.
BeautifulSoup.
BeautifulSoup is a Python package to parse the HTML and XML documents, it provides Pythonic idioms for iterating, searching, and modifying the parse tree.
It can work with different types of parsers like html, xml, html5lib. BeautifulSoup provides API to do a search based on the structural elements and attributes of HTML and XHTML.
Installation.
Since its a Python Package, so the installation is super easy. Install it using PIP. I am assuming that you are working on Python 3 environment.
Usage.
We’ll talk about only those usage which are most common.
Soup.
Before we parse the HTML document, we have to create the soup of the given document, it is the base of all the entire parsing. Soup can be created by passing the HTML content to the BeautifulSoup constructor.
The HTML content can be passed through multiple ways, like passing the file pointer of HTML document(web page) or passing HTML content as a string.
Passing the file pointer.
Passing HTML content as string.
Prettify.
Once the soup is ready we can display it in a very nice and clean way using Prettify. Prettify maintains all the structural hierarchy of HTML while displaying it.
If you run the above code, you’ll get a nice HTML output, which helps us to get a better understanding of the structure and hierarchy of HTML.
Find.
Now we are ready to parse the document and extract the data from it. find helps us to find the HTML elements (div, span, p, a) based on their tag, id, and attributes.
The find always return the only first search result.
Let’s take an small HTML content as an example.
Now we try to extract different information from the above content using different properties.
Find using tag name.
The above code finds out the first HTML element with h1 tag-name, and text returns the text part of that element. So the output of the code will be Godfather.
BeautifulSoup provides multiple ways to extract the same information. You can extract the above information using the following method also.
Find using id.
It provides the functionality to search for an element by its id. You can extract the movie name from the above example using its id.
Find using attributes.
You can search an element based on its attributes like class. We’ll extract genre from the above example by its class.
Find all.
In the above section, we saw that find returns only the first result, but if we want all the elements with the specific property then we can use find_all.
Let’s take another example with multiple movies.
As you can see that all the movies are in the h1 heading elements, so let’s find all the elements with their h1 tag-name. The following code will return all the 3 results.
You might have noticed that we didn’t apply text on the results as we did in find, here in the find_all you have to extract text from the individual result.
Apart from returning multiple results, find_all is functionality-wise same as find.
We have learned some of the basic APIs of BeautifulSoup, but it provides many other APIs for more complex scenarios, so if you are willing to explore then check out the documentation.
Done with the theory? ok, let’s take an example and extract the data from a real website.
For Web Scraping, we need to know the structure of the web page we are dealing with, understanding the HTML structure is the most important part of the Web Scraping. So pay extra attention to the next section.
Analysis of the web page.
From the beginning of the article we are talking about the Top 100 Movies, so let’s take the same example and extract the information about movies from IMDb.
Inspect the web page.
First, open the Top 100 Movies and analyze the HTML structure of the web page. To analyze any web page we can use the inspect feature of the browser. Just do a right-click on the web page and click the inspect option from the list.
This gives us the general HTML layout of the web page, but if you want to see the structure of any particular element then take the cursor over that element on the web page and then do the inspect.
If you move the cursor in the HTML section then you’ll see different highlighted blocks on the web page, which shows the mapping between the HTML element and its corresponding block.
In the above image, there is a highlighted block on the web page and above that block, you can see div.lister-item-content which is the HTML element of that block. Here div is the actual element and lister-item-content is the attribute of the element.
The above-highlighted block contains all the information about a particular movie ( name, year, genre,etc.). Similarly, there are 100 blocks in the web page, one block per movie. If you inspect some of them then you’ll find that they have similar HTML elements and attributes.
Finding the relevant HTML elements.
Now let’s explore the above block further, just click on the div.lister-item-content in the HTML section and you’ll see multiple nested elements. These nested elements have our information, so to find out the HTML element for the given information, just move the cursor over that information on the web page and then inspect it.
Finding out the relevant HTML elements is a fairly easy task. so let’s see what you have found. Below is the list of movie info and their corresponding element.
- Movie Name: Element a under the h3.lister-item-header.
- Year: Element span.lister-item-year text-muted unbold under h3.lister-item-header.
- Runtime: Element span.runtime under p.text-muted text-small.
- Genre: Element span.genre under p.text-muted text-small.
- Rating: Element span.ipl-rating-star__rating.
- Director: Element a under second p.text-muted text-small.
- Stars: Element a under second p.text-muted text-small.
Finding the above information about HTML elements is the most important step of web scraping, so before we move further, make sure you understand each and every part of it.
If you have understood the above sections clearly, then implementation is gonna be a piece of cake. So let’s implement The Web Scraping in Pythonusing BeautifulSoup.
Implementation in Python
Note: Not all the websites allow the Web Scraping, so please be cautious before you do it on the given website, it might get your IP blocked to access that website.
As we discussed that we are going to use Web Scraping to extract the information of Top 100 Moveis, so let’s implement it in Python step by step.
Getting the HTML content.
Here we’ll use Python requests module to fetch the HTML content.
First, we need to have the actual URL to fetch the content, that we have in the line no 4.
In the line no. 5 we are sending the GET request to the URL, which returns the actual HTML content, and some other response-related information like the header. We can extract the HTML content from the response using r.content.
Once we have the actual HTML content then we have created the soup by passing the r.content to the BeautifulSoup.
Finding the main blocks.
In the Analysis of the web page section we have talked about the main block that has all the movie-related information, and the entire web page has 100 such blocks, one block per movie.
So the first task is to find out all the main blocks. As we know from the last section that a given main block is a div element with a lister-item-content class attribute, so we can use this information to find out all blocks.
We have used find_all because we want to fetch all the blocks. Now we iterate through all the blocks and find the information from each block individually.
Extracting information from each block.
This section involves the actual information extraction, so pay extra attention and if you don’t understand any part of it then please refer to the Analysis of the web page section.
Let’s understand the each part of above code step by step.
Movie Name: Since the movie name comes under the first ‘a’ element therefore we can extract it using find. Text is used to extract the text part of it (discussed earlier).
Year: It comes under span element with lister-item-year text-muted unbold class. In the next line, we are removing the enclosing brackets.
Runtime: It can be found in span element with runtime class.
Genre: We can extract it from the span element with genre class. Once we extracted it, we applied string stripe to remove the unwanted spaces.
Rating: It comes under the span element with ipl-rating-star__rating class.
Directors and Stars: All the above parts were easy, but this part is a little tricky. There are more than one ‘p’ elements with the text-muted text-small class, but the information about Directors and Stars comes under the second result, that is why we have taken the 1st result (count start with 0).
Now we remove all the unwanted ‘n’. As you might have noticed that the list of Director(s) and Star(s) is divided by ‘|’, so we apply the split to divide the text which gives people list. After the split, the first part of people contains the list of Directors, and the second part contains the list of Stars.
Names of all the directors come after the Director or Directors keyword, therefore we applied the re.split (it allows split on multiple delimiters). This split gives the list of two elements as follows,
Python Beautifulsoup
Output of the above code for the first movie.
As you can see only the first element contains the Director(s) list, that is why we have picked the 1st element only, then we split it on the split(‘,’) to get all the director(s) name(s) as the list. After getting the list we apply the string strip to clean the text.
Web Scraping In Python Using Beautifulsoup
We applied exactly same logic to extract the list of Stars.
Now we have extracted all the required information that you can save in any format you want.
The next section contains the complete code, and in that code, we save the information into Pandas Dataframe.
Complete Code.
Web Scraping In Python Using Beautifulsoup Data
So this is all about the Web Scraping in Python using BeautifulSoup, if you have any issue or doubt then please let me know in the comments.
Beautifulsoup
Thanks for reading !