{"id":6186,"date":"2020-10-28T17:12:33","date_gmt":"2020-10-28T11:42:33","guid":{"rendered":"https:\/\/www.h2kinfosys.com\/blog\/?p=6186"},"modified":"2022-09-11T13:39:40","modified_gmt":"2022-09-11T08:09:40","slug":"reading-csv-files-with-pandas","status":"publish","type":"post","link":"https:\/\/www.h2kinfosys.com\/blog\/reading-csv-files-with-pandas\/","title":{"rendered":"Reading CSV files with Pandas"},"content":{"rendered":"\n<p>It is common knowledge that data is an indispensable ingredient when building any machine learning project. Do you want to build an AI-powered system? You need data to train your model. You don\u2019t have data? You are out of business.&nbsp;<\/p>\n\n\n\n<p>Most times, the data you\u2019d be working with are in files that can be stored and transferred easily. The comma-separated values file (popularly called CSV file) is a widespread file format for working with data files. You must be able to read, load, explore, and write data to a CSV file with Python. <a href=\"https:\/\/www.h2kinfosys.com\/blog\/getting-started-with-pandas\/\">Pandas<\/a> provide you with an easy-to-use method to carry out these processes with python. And that\u2019s what we will be discussing in the tutorial.&nbsp;<\/p>\n\n\n\n<p>By the end of the tutorial, you will learn<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>How to load a file with pandas<\/li><li>File types and extension<\/li><li>What a CSV file is<\/li><li>What a TSV file is<\/li><li>How to specify delimiters<\/li><li>File Paths and Folder Paths<\/li><li>What Current Working Directory is<\/li><li>What Relative Paths and Absolute Paths are<\/li><li>The Common errors with the read_csv() method<\/li><\/ul>\n\n\n\n<p>Let\u2019s begin with loading a CSV file with pandas<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Loading a CSV file with Pandas<\/strong><\/h2>\n\n\n\n<p>To load a CSV file with pandas, the read_csv() file method is called. First, you\u2019d need to have pandas installed on your PC and imported to your Jupyter notebook or whatever IDLE you are using. The read_csv() method has a lot of arguments that can be tweaked based on your preference.&nbsp;According to the <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.read_csv.html\" rel=\"nofollow noopener\" target=\"_blank\">Pandas official documentation<\/a>, this is the full list of the arguments of the read_csv() method.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><strong>Signature:<\/strong> pd<strong>.<\/strong>read_csv<strong>( <\/strong>filepath_or_buffer<strong>, <\/strong>sep<strong>=',', <\/strong>delimiter<strong>=None, <\/strong>header<strong>='infer', <\/strong>names<strong>=None, <\/strong>index_col<strong>=None, <\/strong>usecols<strong>=None, <\/strong>squeeze<strong>=False, <\/strong>prefix<strong>=None, <\/strong>mangle_dupe_cols<strong>=True, <\/strong>dtype<strong>=None, <\/strong>engine<strong>=None, <\/strong>converters<strong>=None, <\/strong>true_values<strong>=None, <\/strong>false_values<strong>=None, <\/strong>skipinitialspace<strong>=False, <\/strong>skiprows<strong>=None, <\/strong>skipfooter<strong>=0, <\/strong>nrows<strong>=None, <\/strong>na_values<strong>=None, <\/strong>keep_default_na<strong>=True, <\/strong>na_filter<strong>=True, <\/strong>verbose<strong>=False, <\/strong>skip_blank_lines<strong>=True, <\/strong>parse_dates<strong>=False, <\/strong>infer_datetime_format<strong>=False, <\/strong>keep_date_col<strong>=False, <\/strong>date_parser<strong>=None, <\/strong>dayfirst<strong>=False, <\/strong>iterator<strong>=False, <\/strong>chunksize<strong>=None, <\/strong>compression<strong>='infer', <\/strong>thousands<strong>=None, <\/strong>decimal<strong>=b'.', <\/strong>lineterminator<strong>=None, <\/strong>quotechar<strong>='\"', <\/strong>quoting<strong>=0, <\/strong>doublequote<strong>=True, <\/strong>escapechar<strong>=None, <\/strong>comment<strong>=None, <\/strong>encoding<strong>=None, <\/strong>dialect<strong>=None, <\/strong>tupleize_cols<strong>=None, <\/strong>error_bad_lines<strong>=True, <\/strong>warn_bad_lines<strong>=True, <\/strong>delim_whitespace<strong>=False, <\/strong>low_memory<strong>=True, <\/strong>memory_map<strong>=False, <\/strong>float_precision<strong>=None, )<\/strong>&nbsp;\n<strong>Docstring:<\/strong> Read a comma-separated values (CSV) file into DataFrame. Also supports optionally iterating or breaking of the file into chunks.<\/pre>\n\n\n\n<p>Yes, I know it looks really intimidating. But not to worry, but most times, you won\u2019t be defining the long list of parameters here. Of all these parameters, there is only one required parameter -the file path, which is pretty all you need to get started.&nbsp;<\/p>\n\n\n\n<p>The framework for loading a CSV file, saved in the same file path as the python compiler, is shown below.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the pandas library<\/em>\n<strong>import<\/strong> <strong>pandas<\/strong> <strong>as<\/strong> <strong>pd<\/strong>\n\n<em>#read your data using the read_csv method of pandas<\/em>\npd.read_csv('name of your file.csv')\n<\/pre>\n\n\n\n<p>This is for a singular scenario where the file is stored in the same path as your python compiler, the file is not encoded, the content of the file is separated by only commas, the file has no title or description, etc.<\/p>\n\n\n\n<p>There are times when this would not be the case. In order to master the process of reading CSV files with pandas, there are some critical concepts you must understand. First, you must understand what file extensions are and the differences between the various file extensions available. Second, you must understand what a working directory is and what a file path is. Third, you must have a solid understanding of what is in your CSV file itself. Sometimes, they are not just information separated with commas. And as a bonus, you should be able to read and decode the meaning of error messages when they appear.&nbsp;<\/p>\n\n\n\n<p>We will help you have a solid understanding of all the four points listed in this tutorial. Let\u2019s begin with understanding the concept of file types and extension.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Understanding File Types and Extension<\/strong><\/h2>\n\n\n\n<p>When you store a file on your container, the file has a name called filename and an extension. The file extension is simply the code after the dot sign in the file name.&nbsp;<\/p>\n\n\n\n<p>Since various files have various contents, the computer must know how to read these files to access its contents. That\u2019s the essence of a file extension. The content of the image, for instance, is quite different from the content music and a Word document. It is the file extension that informs the computer to read an image like an image, read music, and read a word document as a word document. So, for a file with .png or .jpeg, the computer understands that the content of the file is an image and will parse or read the file accordingly.&nbsp;<\/p>\n\n\n\n<p>In most computers, the file extension is hidden from the user. You can, however, determine the file extension by checking the file properties and check the type of the file. Just right click on the file and click on Properties. Once you do, a dialog box as seen below will appear on your screen.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/H7S3W2VaXlzld5-JyCBjcNzmAGuTOG4ceOx5KLzSFYo9XrhVvaMbYIc9aQkfW_0bZ1REeYXSGkuSbD4nvSkm4gQM7ngGUE7Qgv0yBdgcSl5VxligC3jDUxX3nUBwsZGy_46kMpQIYdS-1YLkFw\" alt=\"Reading CSV files \" title=\"\"><\/figure>\n\n\n\n<p>You can see that this is a CSV file since it has a .csv extension. Alternatively, you can change the settings of your computer to see file extensions alongside the filename. Just click on the View tab of the file explorer and check the \u2018File extension name\u2019. Once checked, the files now appear with both filenames and their respective extension.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/SKTzOJAKNYjMVSwSlEEigMXR9QfD7ooPNLqGxTDxDdUr90fVWiD9AN7sMRD3uNZSU0yJcmAMvYjnQW8MwqpW_nxGuk_qvAbwFyhn7PD75LQHHveEFQ3G8zQk49HHxF-cvYvlNZZO6ZsIs6AqHw\" alt=\"Reading CSV files \" title=\"\"><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What is a CSV file?<\/strong><\/h2>\n\n\n\n<p>Now that we have discussed what a file extension is and how they are useful for the computer to read a file, let\u2019s focus on CSV files (the crux of our discussion). CSV means <a href=\"https:\/\/en.wikipedia.org\/wiki\/Comma-separated_values#:~:text=A%20comma%2Dseparated%20values%20(CSV,name%20for%20this%20file%20format.\" rel=\"nofollow noopener\" target=\"_blank\">comma-separated values<\/a>. It implies that the values in a CSV file are separated by a comma. CSV files are typically used to store data. The columns are separated by a comma while the rows are separated by a new line. A CSV file is simple and flexible which explains why it is used by many data scientists and researchers for storing and retrieving data.&nbsp;<\/p>\n\n\n\n<p>If we attempt to open a CSV with advanced Sheet readers such as Excel, it displays the data in tabular form. If you however try to use Notepad, you will notice how the values in the file are separated with commas. The first line is typically considered as the column name.&nbsp;<\/p>\n\n\n\n<p>Let\u2019s open the iris dataset with both Excel and Notepad so you see for yourself. The iris data is a popular dataset in machine learning that tries to classify flower species based on its sepal length, petal length, sepal width, and petal width. The dataset looks like this when you open it with Excel.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/S0c4wyay3jJFoJXOBUbLxiPamr2vXfhcNE-oomAupuPgsLLWlKYcz2zcgI5RsylRmm_kE1fdJMv1_OBd51zA6oMgtC5zWU8ERg_OPWIVy25vJ3Vy-ohRtirgpeQSerL-oKMSDprYAacCnOq9jw\" alt=\"Reading CSV files \" title=\"\"><\/figure>\n\n\n\n<p>If this same file is opened with Notepad, you can begin to appreciate why it is called a comma-separated value (CSV) file. Have a look.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/R8I9q0mdtKsGidGwO4Q-Ej3hOmnajMyay4xQ3iHTjnlF34cBUzfCxfSa-Os3YykgPe5p1j_FClXAbIhwC_0qUcu_lbjdnwGRMyHYQZjFicGdrx0IAykU0JIlVjYL1HtENzVLMTsX4LHML5PFiw\" alt=\"Reading CSV files \" title=\"\"><\/figure>\n\n\n\n<p>As seen, the first line is considered the column names while subsequent lines are considered the values for each column. There may be slight changes in the way a CSV file is formatted which is why it is strongly advised to inspect your CSV file before importing it to your notebook. Let\u2019s see some of the tweaks that can be done to a CSV file.&nbsp;<\/p>\n\n\n\n<p><strong>TSV files &#8211; It\u2019s Not Always Comma Separated&nbsp;<\/strong><\/p>\n\n\n\n<p>From our discussion so far, it is apparent that commas are a vital part of CSV files. But it\u2019s not always commas. In some situations, the values are separated by tabs (\\t). These are called tab-separated values or TSV files for short. If you want to load a file that is rather separated by tabs, you need to specify it with the sep parameter when calling the read_csv() method. By default, it is set to \u2018,\u2019. It must be changed to \u2018\\t\u2019 else it throws an error.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#read a CSV file rather separated by tab<\/em>\npd.read_csv('name of your file.csv', sep='<strong>\\t<\/strong>')\n<\/pre>\n\n\n\n<p><strong>Specifying Delimiters<\/strong><\/p>\n\n\n\n<p>We have established the fact that the presence of commas in a CSV file separates the column. But what if the text field itself has a comma? For instance, an address column will most like have commas separating the house number, street, city, and state. To ensure pandas do not see the house number, street, city, and state as separate columns, we must encapsulate the address field in \u2018quote character\u2019.&nbsp;<\/p>\n\n\n\n<p>When loading the file with the pandas read_csv() method, the quote character is then specified with the quotechar argument of the method. By default, the quotechar argument is set to double quotation marks (\u201c). This means that any comma that appears in a character encapsulated in double quotation marks would not be separated into another column.&nbsp;<\/p>\n\n\n\n<p>Let\u2019s take an example. In this dataset, the \u2018Purchase Address\u2019 column has some commas but we definitely do not want them separated. Therefore, the entries would be encapsulated in quotation marks. The dataset opened with Notepad is shown below.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/aPYrO-aYM2OawT9RBF-u8U2BaEiDIRiR-UxJ-y2y2YuiT9pBl6X1cgPFZkBsE8TiNSmUdcUuiYMtnYnDnksAWPRMxlZaLefcpYLbxuY09f8gKj2wruioUxsIQdl20GR53c3d4mrko6pEfp6X0w\" alt=\"Reading CSV files \" title=\"\"><\/figure>\n\n\n\n<p>The dataset is read with the code below.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the pandas library<\/em>\n<strong>import<\/strong> <strong>pandas<\/strong> <strong>as<\/strong> <strong>pd<\/strong>\n\n<em>#read a data with the Purchase Address column having commas<\/em>\npd.read_csv('sample data.csv', quotechar='\"').head()\n<\/pre>\n\n\n\n<p>Output:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/muoTNUOZ7JXacL6vl6pUNaKfV_b4jd6yfSCBk4C2ztqQYmCEpcfTKagsiSY8ZG8zo7z2IRazHl5tVXT-RQVQ5YPjfGehXXG100CkzzlazyZjJeO5AY3hapyRcKGMWJnr9-GDAtx8WM4bYRQQ7A\" alt=\"Reading CSV files \" title=\"\"><\/figure>\n\n\n\n<p>The process is the same for TSV files when tabs exist in a particular column.&nbsp;<\/p>\n\n\n\n<p><strong>Understanding File and Folder Paths<\/strong><\/p>\n\n\n\n<p>When you attempt to load any file by specifying its filename, Python checks your current work directory for the file. If the filename is not found in the current work directory, your Jupyter notebook throws a FIleNotFoundError. So, the question is\u2026<\/p>\n\n\n\n<p><strong>How do you find Your Current Working Directory in Python?<\/strong><\/p>\n\n\n\n<p>The os.getcwd() method is used to print your current working directory. If you want to display the files in your current working directory, the os.listdir() method does this. Note that to use these methods, you need to import the OS library. The code below prints the current working directory on your PC.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><em>#import the OS library<\/em>\n<strong>import<\/strong> <strong>os<\/strong>\n&nbsp;\n<em>#print the current working directory&nbsp;<\/em>\nos.getcwd()<\/pre>\n\n\n\n<p>Output:<\/p>\n\n\n\n<p>&#8216;C:<strong>\\\\<\/strong>Users<strong>\\\\<\/strong>wale obembe&#8217;<\/p>\n\n\n\n<p>The output will be different for you.&nbsp;<\/p>\n\n\n\n<p><strong>Understanding Absolute and Relative Path<\/strong><\/p>\n\n\n\n<p>When using the read_csv() method, the file name can be specified using either the relative path or the absolute path. Let\u2019s understand what these are.<\/p>\n\n\n\n<p><strong>What is a Relative Path?<\/strong><\/p>\n\n\n\n<p>A relative path is a path to the file when you begin the file location from the current working directory. If in my current working directory, I want to access the file named \u2018Data_1.csv\u2019 in a folder named \u2018Dataset\u2019 I only need to specify the relative path of the file, which is \u201cDataset\/ Data_1.csv\u201d. Note that relative paths are only used for files within your current working directory.<\/p>\n\n\n\n<p><strong>What is an Absolute Path?<\/strong><\/p>\n\n\n\n<p>An absolute path is a path to a file when specifying the file location from the base of your computer file system. If you want to specify a file outside your current working directory, you must use the absolute path. Say I want to load a file called SampleData in the \u2018Document\u2019 folder, I must specify the absolute path which is &#8220;C:\/Users\/wale obembe\/Documents\/SampleData.doc&#8221;.&nbsp;<\/p>\n\n\n\n<p>You can easily copy your file path by clicking on \u2018Copy Path\u2019 from your clipboard<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/tBVEuwwG5P5qu5l6WIn19LSHiS-ILtnUex3l19kixIZi0FyjTVZADMG7o2UiS40ol9IXAOK__wazJcKV0DUfUs0O4BHAInek8Pgdu9d-IMSCmY-jAhDX6qEX0xxeuOQvKiJGMz_t2SeeTrH-pQ\" alt=\"Reading CSV files \" title=\"\"><\/figure>\n\n\n\n<p>Note: It is advised to use relative paths rather than absolute paths. With absolute paths, your code will run even when you transfer it to another computer. This is not the case for absolute paths.&nbsp;<\/p>\n\n\n\n<p>Let\u2019s round off this tutorial by treating some of the common errors you\u2019d be faced with when reading data with the read_csv() method.&nbsp;<\/p>\n\n\n\n<p><strong>Common Errors with Pandas read_csv() Method<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>SyntaxError: (Unicode error) &#8216;Unicode escape&#8217; codec can&#8217;t decode bytes in position 2-3: truncated \\UXXXXXXXX escape. You will get this error when you use backward slash (\\) for your file path rather than forward-slash (\/). This is perhaps the most common errors, neophytes face. When you copy the file path from your clipboard, it uses the backward slash which is seen as a normal string. To fix this error, you need to convert the normal string to a raw string by adding an \u2018r\u2019 just before specifying the path. An example: r &#8220;C:\\Users\\wale obembe\\Documents\\SampleData.csv&#8221; rather than just &#8220;C:\\Users\\wale obembe\\Documents\\SampleData.csv&#8221;.&nbsp;<\/li><li>FileNotFoundError: This is another common error and it is self-explanatory \u2013 the file is not found in the specified directory. It may be that you specified the filename without its extension. You will need to add the extension. For instance, it is SampleData.csv and not just SampleData.&nbsp;<\/li><li>UnicodeDecodeError:&nbsp;&#8216;utf-8&#8217; codec can&#8217;t decode byte in position: invalid continuation byte. Every file has some form of encoding, the common being \u2018UTF-8\u2019. Sometimes, your file may contain non-standard characters and be saved with a different encoding. You will need to check the encoding of your document and specify it using the encoding parameter of the read_csv() method. You can find your file when you open the file with Notebook. It is always in the bottom left corner<\/li><\/ol>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/3umBrkH3t6CGSSHHUSYJvkoEcTquTi6eLpW5V84K5Ye67RCvduBH3UzWqO_hBDhPS7dQbhTeQglg3wwWftbIdzeregSKpM0AnN9ncW1WbdTyM69B5R_1LumCDij2FnRCr25hs6Gz4cizaA9Gzw\" alt=\"\" title=\"\"><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>It is common knowledge that data is an indispensable ingredient when building any machine learning project. Do you want to build an AI-powered system? You need data to train your model. You don\u2019t have data? You are out of business.&nbsp; Most times, the data you\u2019d be working with are in files that can be stored [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":6211,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":"","_members_access_role":[],"_members_access_error":""},"categories":[498],"tags":[],"class_list":["post-6186","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence-tutorials"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/6186","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/comments?post=6186"}],"version-history":[{"count":0,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/6186\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media\/6211"}],"wp:attachment":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media?parent=6186"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/categories?post=6186"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/tags?post=6186"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}