It is common knowledge that data is an indispensable ingredient when building any machine learning project. Do you want to build an AI-powered system? You need data to train your model. You don’t have data? You are out of business.
Most times, the data you’d be working with are in files that can be stored and transferred easily. The comma-separated values file (popularly called CSV file) is a widespread file format for working with data files. You must be able to read, load, explore, and write data to a CSV file with Python. Pandas provide you with an easy-to-use method to carry out these processes with python. And that’s what we will be discussing in the tutorial.
By the end of the tutorial, you will learn
- How to load a file with pandas
- File types and extension
- What a CSV file is
- What a TSV file is
- How to specify delimiters
- File Paths and Folder Paths
- What Current Working Directory is
- What Relative Paths and Absolute Paths are
- The Common errors with the read_csv() method
Let’s begin with loading a CSV file with pandas
Loading a CSV file with Pandas
To load a CSV file with pandas, the read_csv() file method is called. First, you’d need to have pandas installed on your PC and imported to your Jupyter notebook or whatever IDLE you are using. The read_csv() method has a lot of arguments that can be tweaked based on your preference. According to the Pandas official documentation, this is the full list of the arguments of the read_csv() method.
Signature: pd.read_csv( filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, ) Docstring: Read a comma-separated values (CSV) file into DataFrame. Also supports optionally iterating or breaking of the file into chunks.
Yes, I know it looks really intimidating. But not to worry, but most times, you won’t be defining the long list of parameters here. Of all these parameters, there is only one required parameter -the file path, which is pretty all you need to get started.
The framework for loading a CSV file, saved in the same file path as the python compiler, is shown below.
#import the pandas library import pandas as pd #read your data using the read_csv method of pandas pd.read_csv('name of your file.csv')
This is for a singular scenario where the file is stored in the same path as your python compiler, the file is not encoded, the content of the file is separated by only commas, the file has no title or description, etc.
There are times when this would not be the case. In order to master the process of reading CSV files with pandas, there are some critical concepts you must understand. First, you must understand what file extensions are and the differences between the various file extensions available. Second, you must understand what a working directory is and what a file path is. Third, you must have a solid understanding of what is in your CSV file itself. Sometimes, they are not just information separated with commas. And as a bonus, you should be able to read and decode the meaning of error messages when they appear.
We will help you have a solid understanding of all the four points listed in this tutorial. Let’s begin with understanding the concept of file types and extension.
Understanding File Types and Extension
When you store a file on your container, the file has a name called filename and an extension. The file extension is simply the code after the dot sign in the file name.
Since various files have various contents, the computer must know how to read these files to access its contents. That’s the essence of a file extension. The content of the image, for instance, is quite different from the content music and a Word document. It is the file extension that informs the computer to read an image like an image, read music, and read a word document as a word document. So, for a file with .png or .jpeg, the computer understands that the content of the file is an image and will parse or read the file accordingly.
In most computers, the file extension is hidden from the user. You can, however, determine the file extension by checking the file properties and check the type of the file. Just right click on the file and click on Properties. Once you do, a dialog box as seen below will appear on your screen.
You can see that this is a CSV file since it has a .csv extension. Alternatively, you can change the settings of your computer to see file extensions alongside the filename. Just click on the View tab of the file explorer and check the ‘File extension name’. Once checked, the files now appear with both filenames and their respective extension.
What is a CSV file?
Now that we have discussed what a file extension is and how they are useful for the computer to read a file, let’s focus on CSV files (the crux of our discussion). CSV means comma-separated values. It implies that the values in a CSV file are separated by a comma. CSV files are typically used to store data. The columns are separated by a comma while the rows are separated by a new line. A CSV file is simple and flexible which explains why it is used by many data scientists and researchers for storing and retrieving data.
If we attempt to open a CSV with advanced Sheet readers such as Excel, it displays the data in tabular form. If you however try to use Notepad, you will notice how the values in the file are separated with commas. The first line is typically considered as the column name.
Let’s open the iris dataset with both Excel and Notepad so you see for yourself. The iris data is a popular dataset in machine learning that tries to classify flower species based on its sepal length, petal length, sepal width, and petal width. The dataset looks like this when you open it with Excel.
If this same file is opened with Notepad, you can begin to appreciate why it is called a comma-separated value (CSV) file. Have a look.
As seen, the first line is considered the column names while subsequent lines are considered the values for each column. There may be slight changes in the way a CSV file is formatted which is why it is strongly advised to inspect your CSV file before importing it to your notebook. Let’s see some of the tweaks that can be done to a CSV file.
TSV files – It’s Not Always Comma Separated
From our discussion so far, it is apparent that commas are a vital part of CSV files. But it’s not always commas. In some situations, the values are separated by tabs (\t). These are called tab-separated values or TSV files for short. If you want to load a file that is rather separated by tabs, you need to specify it with the sep parameter when calling the read_csv() method. By default, it is set to ‘,’. It must be changed to ‘\t’ else it throws an error.
#read a CSV file rather separated by tab pd.read_csv('name of your file.csv', sep='\t')
We have established the fact that the presence of commas in a CSV file separates the column. But what if the text field itself has a comma? For instance, an address column will most like have commas separating the house number, street, city, and state. To ensure pandas do not see the house number, street, city, and state as separate columns, we must encapsulate the address field in ‘quote character’.
When loading the file with the pandas read_csv() method, the quote character is then specified with the quotechar argument of the method. By default, the quotechar argument is set to double quotation marks (“). This means that any comma that appears in a character encapsulated in double quotation marks would not be separated into another column.
Let’s take an example. In this dataset, the ‘Purchase Address’ column has some commas but we definitely do not want them separated. Therefore, the entries would be encapsulated in quotation marks. The dataset opened with Notepad is shown below.
The dataset is read with the code below.
#import the pandas library import pandas as pd #read a data with the Purchase Address column having commas pd.read_csv('sample data.csv', quotechar='"').head()
The process is the same for TSV files when tabs exist in a particular column.
Understanding File and Folder Paths
When you attempt to load any file by specifying its filename, Python checks your current work directory for the file. If the filename is not found in the current work directory, your Jupyter notebook throws a FIleNotFoundError. So, the question is…
How do you find Your Current Working Directory in Python?
The os.getcwd() method is used to print your current working directory. If you want to display the files in your current working directory, the os.listdir() method does this. Note that to use these methods, you need to import the OS library. The code below prints the current working directory on your PC.
#import the OS library import os #print the current working directory os.getcwd()
The output will be different for you.
Understanding Absolute and Relative Path
When using the read_csv() method, the file name can be specified using either the relative path or the absolute path. Let’s understand what these are.
What is a Relative Path?
A relative path is a path to the file when you begin the file location from the current working directory. If in my current working directory, I want to access the file named ‘Data_1.csv’ in a folder named ‘Dataset’ I only need to specify the relative path of the file, which is “Dataset/ Data_1.csv”. Note that relative paths are only used for files within your current working directory.
What is an Absolute Path?
An absolute path is a path to a file when specifying the file location from the base of your computer file system. If you want to specify a file outside your current working directory, you must use the absolute path. Say I want to load a file called SampleData in the ‘Document’ folder, I must specify the absolute path which is “C:/Users/wale obembe/Documents/SampleData.doc”.
You can easily copy your file path by clicking on ‘Copy Path’ from your clipboard
Note: It is advised to use relative paths rather than absolute paths. With absolute paths, your code will run even when you transfer it to another computer. This is not the case for absolute paths.
Let’s round off this tutorial by treating some of the common errors you’d be faced with when reading data with the read_csv() method.
Common Errors with Pandas read_csv() Method
- SyntaxError: (Unicode error) ‘Unicode escape’ codec can’t decode bytes in position 2-3: truncated \UXXXXXXXX escape. You will get this error when you use backward slash (\) for your file path rather than forward-slash (/). This is perhaps the most common errors, neophytes face. When you copy the file path from your clipboard, it uses the backward slash which is seen as a normal string. To fix this error, you need to convert the normal string to a raw string by adding an ‘r’ just before specifying the path. An example: r “C:\Users\wale obembe\Documents\SampleData.csv” rather than just “C:\Users\wale obembe\Documents\SampleData.csv”.
- FileNotFoundError: This is another common error and it is self-explanatory – the file is not found in the specified directory. It may be that you specified the filename without its extension. You will need to add the extension. For instance, it is SampleData.csv and not just SampleData.
- UnicodeDecodeError: ‘utf-8’ codec can’t decode byte in position: invalid continuation byte. Every file has some form of encoding, the common being ‘UTF-8’. Sometimes, your file may contain non-standard characters and be saved with a different encoding. You will need to check the encoding of your document and specify it using the encoding parameter of the read_csv() method. You can find your file when you open the file with Notebook. It is always in the bottom left corner