In the previous article, we have discussed what pandas is? It’s importance in data science, how to install it, and perform basic operations like adding and deleting index, rows, and columns in a DataFrame. Now we will dive deeper into the applications of pandas in real-time situations like Data Reading and Data Inspection Using Pandas
As a data scientist or an analyst, you’ll probably come across many file types to import and use in your Python scripts. Some analysts use Microsoft Excel, but the application limits what you can do with large data imports. The better option is pandas. It is a powerful analysis toolkit that’s much more intuitive for a data scientist.
What file formats can pandas use…?
Python can handle virtually any data file format much more than Microsoft Excel. That’s the strength of Python. It’s open-source, and there’s probably a library out there to handle it, so you get a vastly more compatible system.
These are the most common types of Data which we will come across
- Comma-separated values (CSV)
- XLSX
- JSON
- XML
- HTML
- Images
- DOCX
- SQL
How to read and write tabular data ?
Now we will learn to read and write data using pandas functions. We will use pandas read_csv() and .to_csv() functions
A comma-separated values (CSV) file is a plaintext file with a .csv extension that holds tabular data. This is one of the most popular file formats for storing large amounts of data. Each row of the CSV file represents a single table row. The values in the same row are by default separated with commas, but you could change the separator to a semicolon, tab, space, or some other character.
Write a CSV File
You can save your Pandas DataFrame as a CSV file with .to_csv():
df.to_csv(‘data.csv’)
That’s it! You’ve created the file data.csv in your current working directory. You can expand the code block below to see how your CSV file should look:
data.csv
Read a CSV File
Once your data is saved in a CSV file, you’ll likely want to load and use it from time to time. You can do that with the Pandas read_csv() function:
df = pd.read_csv(‘data.csv’, index_col=0)
df
COUNTRY | POP | AREA | CONT | IND_DAY |
China | 1398.72 | 9596.96 | Asia | NaN |
India | 1351.16 | 3287.26 | Asia | 1947-08-15 |
US | 329.74 | 9833.52 | N.America | 1776-07-04 |
Indonesia | 268.07 | 1910.93 | Asia | 1945-08-17 |
Write an Excel File
You can save your Pandas DataFrame as a CSV file with .to_excel():
df.to_csv(‘data.csv’)
Read an Excel File
You can do that with the Pandas read_excel() function:
df = pd.read_excel('data.xlsx', index_col=0)
Write an Json File
You can save your Pandas DataFrame as a CSV file with .to_json():
df.to_json('data-index.json', orient='index')
Read an Json File
You can do that with the Pandas read_json() function:
df = pd.read_json('data.xlsx', index_col=0)
Write Files
Series and DataFrame objects have methods that enable writing data and labels to the clipboard or files. They’re named with the pattern .to_<file-type>(), where <file-type> is the type of the target file.
You’ve learned about .to_csv() and .to_excel(), but there are others, including:
- .to_json()
- .to_html()
- .to_sql()
- .to_pickle()
There are still more file types that you can write to, so this list is not exhaustive.
Read Files
Pandas functions for reading the contents of files are named using the pattern .read_<file-type>(), where <file-type> indicates the type of the file to read. You’ve already seen the Pandas read_csv() and read_excel() functions. Here are a few others:
- read_json()
- read_html()
- read_sql()
- read_pickle()
These functions have a parameter that specifies the target file path. It can be any valid string that represents the path, either on a local machine or in a URL. Other objects are also acceptable depending on the file type.
How to view and inspect data in a DataFrame ?
For checking the data of pandas.DataFrame and pandas.Series with many rows and columns head() and tail() methods are useful.
Now we will use Iris Data set from kaggle for this tutorial
“ https://www.kaggle.com/uciml/iris “
import pandas as pd
df = sns.load_dataset("iris")
Get first n rows of DataFrame: head()
The head() method returns the first n rows.
print(df.head(5))
sepal_length | sepal_width | petal_length | petal_width | species | |
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Get first n rows of DataFrame: tail()
The tail() method returns the first n rows.
print(df.tail(5))
sepal_length | sepal_width | petal_length | petal_width | species | |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 26.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
159 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
Pandas .shape,.size and .ndim are used to return size, shape and dimensions of data frames and series.
Create a DataFrame
import pandas as pd import numpy as np d={‘Name’:pd.Series(['Tom','James','Ricky','Vin', 'Steve']),'Age':pd.Series([25,26,25,23,30]),'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20])} #Create a DataFrame df = pd.DataFrame(d) print df
Output:
Age | Name | Rating | |
0 | 25 | Tom | 4.23 |
1 | 26 | James | 3.24 |
2 | 25 | Ricky | 3.98 |
3 | 23 | Vin | 2.56 |
4 | 30 | Steve | 3.20 |
.shape Returns a tuple representing the dimensionality of the DataFrame. Tuple (a,b), where a represents the number of rows and b represents the number of columns.
df.shape
Output:
(5, 3) //5 Rows & 3 Columns
.size Returns the number of elements in the DataFrame.
df.size
Output:
21 // The total number of elements in our object is:
.ndim Returns the number of dimensions of the object. By definition, DataFrame is a 2D object
df.ndim
Output:
2 // The dimension of the object is
Pandas .info() function is used to print a concise summary of a DataFrame. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.
Consider the following DataFrame df
int_col | text_col | float_col | |
0 | 1 | alpha | 0.00 |
1 | 2 | beta | 0.25 |
2 | 3 | gamma | 0.50 |
3 | 4 | delta | 0.75 |
4 | 5 | Epsilon | 1.00 |
df.info()
Output:
<class ‘pandas.core.frame.DataFrame’> RangeIndex: 5 entries, 0 to 4 Data columns (total 3 columns): float_col 5 non-null float64 int_col 5 non-null int64 text_col 5 non-null object dtypes: float64(1), int64(1), object(1) memory usage: 192.0+ bytes
Pandas .describe() function computes a summary of statistics pertaining to the DataFrame columns. This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns
df.describe()
Output:
float_col | int_col | |
count | 5.000000 | 5.000000 |
mean | 0.500000 | 3.000000 |
std | 0.395285 | 1.581139 |
min | 0.000000 | 1.000000 |
25% | 0.250000 | 2.000000 |
50% | 0.500000 | 3.000000 |
75% | 0.750000 | 4.000000 |
max | 1.000000 | 5.000000 |
Pandas .value_counts() function returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
Consider a DataFrame
Student | |
0 | Harry |
1 | Mike |
2 | Arther |
3 | Harry |
4 | Arther |
Output:
Harry 2 Arther 2 Mike 1 Nick 1 Name: Student, dtype: int64
Pandas is a huge concept where we will learn all its components in a step by step manner. In the next article we will discuss about data selection, data cleaning, filtering, sorting, group-by, joining and combining of the dataset