Accessing the Internet in Python Using Urllib Library

Accessing the Internet in Python Using Urllib Library

Table of Contents

The urllib library is a library used to access the internet and get information from websites, including their publicly available source code. You can access and get data from a website using the ult library. Data obtained are generally in JSON format, HTML, or XML. In the tutorial, you will see how to get data from a website using the urllib library. By the end of this tutorial, you will know:

  • How to a send a request to a url
  • How to read HTML files from a URL
  • How to get response headers from a URL

Let’s jump into it. 

How to Send a Request to a URL

You can send a request to a URL using the urllib.request() method. Let us see how to run the code

#import the request library
from urllib import request
#send a request to open the website
url = request.urlopen('https://www.h2kinfosys.com/blog/')
 
#print the result code
print('The result code is ', url.getcode())
#print the status
print('The status is ', url.status)
Output:
The result code is  200
The status is  200

Let’s unpack the code above. We begin by importing the request function from urllib library. Afterward, parse the URL you wish to open with the urlopen() function. Finally, check whether the process was successful or not by printing the result code or status. 

In both cases, the number 200 was returned. 200 is an HTTP code that shows the request was processed successfully. Another successful HTTP code is 301. 

However, numbers such as 404 or 500 are error codes. 

How to read HTML files from a URL

You can read HTML file from a url using the request() method. The code below reads the HTML codes for the website defined.

Output:
The output is a bunch of HTML codes. 

How to get response headers from a URL

You can get the website headers using the getheaders() method. If you don’t know what a header is, the header of a website is simply the website’s metadata. The code below gets the header of the URL passed in.

#import the request library
from urllib import request
#send a request to open the website
url = request.urlopen('https://www.h2kinfosys.com/blog/')
 
#print the header
print(url.getheaders())
Output:
[('Date', 'Sun, 07 Feb 2021 14:32:30 GMT'), ('Server', 'Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips PHP/7.4.5'), ('X-Powered-By', 'PHP/7.4.5'), ('Link', '<https://www.h2kinfosys.com/blog/wp-json/>; rel="https://api.w.org/"'), ('Link', '<https://www.h2kinfosys.com/blog/>; rel=shortlink'), ('Vary', 'Accept-Encoding'), ('Cache-Control', 'max-age=172800'), ('Expires', 'Tue, 09 Feb 2021 14:32:30 GMT'), ('Strict-Transport-Security', 'max-age=31536000'), ('Connection', 'close'), ('Transfer-Encoding', 'chunked'), ('Content-Type', 'text/html; charset=UTF-8')]

Note that there is a cleaner way of scraping data from a website – using Beautifulsoup and the requests library. You may decide to use the urllib library to avoid external dependencies. 

One more thing. It is crucial to point out that many popular websites such as Google, Twitter, Facebook, Amazon, Wikipedia, etc. are not in support of manually requesting data from their website. They would rather have you use their API to access data as it is cleaner and frees traffic they get on their URL address. Manually scraping data over a period of time may trigger their system and have your IP blocked especially if you hit them with too many requests in a short time. 

If you have any questions, feel free to leave them in the comment section, and I’d do my best to answer them.

Share this article