A Sip of BeautifulSoup

Rumjot kaur
4 min readMay 18, 2021

--

Relax and take a sip

BeautifulSoup is one of the most interesting and fascinating packages of python according to me. It is used to parse HTML and XML documents. In plain English, it is used to extract data from HTML and XML documents.

BeautifulSoup has been saving the efforts and energy of programmers since 2004. It is used in web scraping with python. Web Scraping in itself is an interesting topic to discuss.

NOTE: You can’t scrape any website as we need permission for that from the owner. And if you still tried to scrape such a website it will be considered illegal.

In this blog today, I will demonstrate to you a simple program that will help you get the basic idea of how BeautifulSoup works. We will extract the data from a site called quotes.toscrape .

OUR TASK IS TO GET THE NAMES OF PROMINENT AUTHORS ON THE FIRST PAGE.

So, let us get started and take a sip of BeautifulSoup.

(I am using Jupiter notebook and BeautifulSoup version 4 for today’s demo.)

STEP 1:

You need to set up the environment i.e., install the required libraries in your python.

Write the below statements in your anaconda prompt:

1. pip install “Request”

Request library requests the website to allow us to extract data from it.

screenshot of anaconda prompt installing the request library.

2. pip install “LXML”

Screenshot of anaconda prompt installing lxml

3. pip install “bs4 (BeautifulSoup version 4)”, our main library.

Screenshot of anaconda prompt installing bs4

Once they are installed, import them into your python script

STEP-2:

Get the URL of the website you want to scrape. For that use the below functions:

In case, the website doesn’t permit us to scrape, the second statement i.e., result.text will show an error.

In the get() function, we will mention the URL of the page which we want to scrape. In our case, it is the first page of site quotes.toscrape. Get() function will return the HTML code as a giant python string.

STEP-3:

To parse through the python string, we will use the BeautifulSoup library, which on the backend will use “LXML”. With the help of “LXML”, BeautifulSoup will get to know more about the HTML code.

The output of the above statement will be something like this:

Now, this looks familiar, isn’t it? Thanks to BeautifulSoup

STEP-4:

To get the name of the authors on the first page, look through the HTML code and find the tag where the author’s name is mentioned.

On careful read, through the code, you will be able to get this tag called “small” which has the class name “author”.

And you will notice that this class is present before the all-author’s name. Hence, we will use this class name.

NOTE: We can use element tags too inside the select(). But in the case where we have a lot of similar tags, using tags in the select will become complicated.

The value of the cup will be

As you can see it is of the data-type LIST with a lot of extra information. If you are an HTML user you must be aware that the only text in the LIST is the author’s name, the rest are the tags.

Thus we will iterate through this list and extract that text using the function get_text().

Our final result will be

Similarly, if you want to get the author names on the next page, the URL of that page will be: http://quotes.toscrape.com/page/2/

Concluding the whole steps:

1.Install and import the libraries

2.Get the URL of the website you want to scrape

3.Get the HTML code using LXML parser

4.Look for the class (or element tag) name as per the question

That was easy and interesting. Wasn’t it?

Congratulations! You have taken the first sip of soup and I hope it easily got digested.

But this is not the end there is still a lot more to go.

So, here is a task for you: Use the same site and extract the quotes from the first page.

Tell me in the comments, were you able to complete the task or are facing any problems. I am eager to know your results.

Until next time!

--

--