{"id":8328,"date":"2021-02-10T16:34:34","date_gmt":"2021-02-10T11:04:34","guid":{"rendered":"https:\/\/www.h2kinfosys.com\/blog\/?p=8328"},"modified":"2021-02-10T16:34:35","modified_gmt":"2021-02-10T11:04:35","slug":"accessing-the-internet-in-python-using-urllib-library","status":"publish","type":"post","link":"https:\/\/www.h2kinfosys.com\/blog\/accessing-the-internet-in-python-using-urllib-library\/","title":{"rendered":"Accessing the Internet in Python Using Urllib Library"},"content":{"rendered":"\n<p>The urllib library is a library used to access the internet and get information from websites, including their publicly available source code. You can access and get data from a website using the ult library. Data obtained are generally in<a href=\"https:\/\/www.h2kinfosys.com\/blog\/working-with-json-in-python\/\" class=\"rank-math-link\"> JSON format<\/a>, HTML, or XML. In the tutorial, you will see how to get data from a website using the urllib library. By the end of this tutorial, you will know:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>How to a send a request to a url<\/li><li>How to read HTML files from a URL<\/li><li>How to get response headers from a URL<\/li><\/ul>\n\n\n\n<p>Let\u2019s jump into it.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How to Send a Request to a URL<\/h2>\n\n\n\n<p>You can send a request to a URL using the urllib.request() method. Let us see how to run the code<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">#import the request library\nfrom urllib import request\n#send a request to open the website\nurl = request.urlopen('https:\/\/www.h2kinfosys.com\/blog\/')\n\u00a0\n#print the result code\nprint('The result code is ', url.getcode())\n#print the status\nprint('The status is ', url.status)\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">Output:\nThe result code is\u00a0 200\nThe status is\u00a0 200\n<\/pre>\n\n\n\n<p>Let\u2019s unpack the code above. We begin by importing the request function from urllib library. Afterward, parse the URL you wish to open with the urlopen() function. Finally, check whether the process was successful or not by printing the result code or status.&nbsp;<\/p>\n\n\n\n<p>In both cases, the number 200 was returned. 200 is an <a href=\"https:\/\/en.wikipedia.org\/wiki\/List_of_HTTP_status_codes\" class=\"rank-math-link\" rel=\"nofollow noopener\" target=\"_blank\">HTTP code<\/a> that shows the request was processed successfully. Another successful HTTP code is 301.\u00a0<\/p>\n\n\n\n<p>However, numbers such as 404 or 500 are error codes.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How to read HTML files from a URL<\/h2>\n\n\n\n<p>You can read HTML file from a url using the request() method. The code below reads the HTML codes for the website defined.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Output:\nThe output is a bunch of HTML codes. \n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">How to get response headers from a URL<\/h2>\n\n\n\n<p>You can get the website headers using the getheaders() method. If you don&#8217;t know what a header is, the header of a website is simply the website\u2019s metadata. The code below gets the header of the URL passed in.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">#import the request library\nfrom urllib import request\n#send a request to open the website\nurl = request.urlopen('https:\/\/www.h2kinfosys.com\/blog\/')\n\u00a0\n#print the header\nprint(url.getheaders())<\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>Output:\n&#91;('Date', 'Sun, 07 Feb 2021 14:32:30 GMT'), ('Server', 'Apache\/2.4.6 (CentOS) OpenSSL\/1.0.2k-fips PHP\/7.4.5'), ('X-Powered-By', 'PHP\/7.4.5'), ('Link', '&lt;https:\/\/www.h2kinfosys.com\/blog\/wp-json\/>; rel=\"https:\/\/api.w.org\/\"'), ('Link', '&lt;https:\/\/www.h2kinfosys.com\/blog\/>; rel=shortlink'), ('Vary', 'Accept-Encoding'), ('Cache-Control', 'max-age=172800'), ('Expires', 'Tue, 09 Feb 2021 14:32:30 GMT'), ('Strict-Transport-Security', 'max-age=31536000'), ('Connection', 'close'), ('Transfer-Encoding', 'chunked'), ('Content-Type', 'text\/html; charset=UTF-8')]\n<\/code><\/pre>\n\n\n\n<p>Note that there is a cleaner way of scraping data from a website &#8211; using Beautifulsoup and the requests library. You may decide to use the urllib library to avoid external dependencies.&nbsp;<\/p>\n\n\n\n<p>One more thing. It is crucial to point out that many popular websites such as Google, Twitter, Facebook, Amazon, Wikipedia, etc. are not in support of manually requesting data from their website. They would rather have you use their API to access data as it is cleaner and frees traffic they get on their URL address. Manually<a href=\"https:\/\/en.wikipedia.org\/wiki\/Web_scraping\" class=\"rank-math-link\" rel=\"nofollow noopener\" target=\"_blank\"> scraping data<\/a> over a period of time may trigger their system and have your IP blocked especially if you hit them with too many requests in a short time.\u00a0<\/p>\n\n\n\n<p>If you have any questions, feel free to leave them in the comment section, and I\u2019d do my best to answer them.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The urllib library is a library used to access the internet and get information from websites, including their publicly available source code. You can access and get data from a website using the ult library. Data obtained are generally in JSON format, HTML, or XML. In the tutorial, you will see how to get data [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":8331,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[342],"tags":[],"class_list":["post-8328","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-python-tutorials"],"_links":{"self":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/8328","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/comments?post=8328"}],"version-history":[{"count":0,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/8328\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media\/8331"}],"wp:attachment":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media?parent=8328"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/categories?post=8328"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/tags?post=8328"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}