Data Reading and Data Inspection Using Pandas

In the previous article, we have discussed what pandas is? It’s importance in data science, how to install it, and perform basic operations like adding and deleting index, rows, and columns in a DataFrame. Now we will dive deeper into the applications of pandas in real-time situations like Data Reading and Data Inspection Using Pandas

As a data scientist or an analyst, you’ll probably come across many file types to import and use in your Python scripts. Some analysts use Microsoft Excel, but the application limits what you can do with large data imports. The better option is pandas. It is a powerful analysis toolkit that’s much more intuitive for a data scientist.

What file formats can pandas use…?

Python can handle virtually any data file format much more than Microsoft Excel. That’s the strength of Python. It’s open-source, and there’s probably a library out there to handle it, so you get a vastly more compatible system.

These are the most common types of Data which we will come across

Comma-separated values (CSV)

XLSX

JSON

HTML

Images

DOCX

How to read and write tabular data ?

Now we will learn to read and write data using pandas functions. We will use pandas read_csv() and .to_csv() functions

A comma-separated values (CSV) file is a plaintext file with a .csv extension that holds tabular data. This is one of the most popular file formats for storing large amounts of data. Each row of the CSV file represents a single table row. The values in the same row are by default separated with commas, but you could change the separator to a semicolon, tab, space, or some other character.

Write a CSV File

You can save your Pandas DataFrame as a CSV file with .to_csv():

df.to_csv(‘data.csv’)

That’s it! You’ve created the file data.csv in your current working directory. You can expand the code block below to see how your CSV file should look:

data.csv

Read a CSV File

Once your data is saved in a CSV file, you’ll likely want to load and use it from time to time. You can do that with the Pandas read_csv() function:

df = pd.read_csv(‘data.csv’, index_col=0)

COUNTRY	POP	AREA	CONT	IND_DAY
China	1398.72	9596.96	Asia	NaN
India	1351.16	3287.26	Asia	1947-08-15
US	329.74	9833.52	N.America	1776-07-04
Indonesia	268.07	1910.93	Asia	1945-08-17

Write an Excel File

You can save your Pandas DataFrame as a CSV file with .to_excel():

df.to_csv(‘data.csv’)

Read an Excel File

You can do that with the Pandas read_excel() function:

df = pd.read_excel('data.xlsx', index_col=0)

Write an Json File

You can save your Pandas DataFrame as a CSV file with .to_json():

df.to_json('data-index.json', orient='index')

Read an Json File

You can do that with the Pandas read_json() function:

df = pd.read_json('data.xlsx', index_col=0)

Write Files

Series and DataFrame objects have methods that enable writing data and labels to the clipboard or files. They’re named with the pattern .to_<file-type>(), where <file-type> is the type of the target file.

You’ve learned about .to_csv() and .to_excel(), but there are others, including:

.to_json()

.to_html()

.to_sql()

.to_pickle()

There are still more file types that you can write to, so this list is not exhaustive.

Read Files

Pandas functions for reading the contents of files are named using the pattern .read_<file-type>(), where <file-type> indicates the type of the file to read. You’ve already seen the Pandas read_csv() and read_excel() functions. Here are a few others:

read_json()

read_html()

read_sql()

read_pickle()

These functions have a parameter that specifies the target file path. It can be any valid string that represents the path, either on a local machine or in a URL. Other objects are also acceptable depending on the file type.

How to view and inspect data in a DataFrame ?

For checking the data of pandas.DataFrame and pandas.Series with many rows and columns head() and tail() methods are useful.

Now we will use Iris Data set from kaggle for this tutorial

“ https://www.kaggle.com/uciml/iris “

import pandas as pd

df = sns.load_dataset("iris")

Get first n rows of DataFrame: head()

The head() method returns the first n rows.

print(df.head(5))

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Get first n rows of DataFrame: tail()

The tail() method returns the first n rows.

print(df.tail(5))

	sepal_length	sepal_width	petal_length	petal_width	species
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	26.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
159	5.9	3.0	5.1	1.8	virginica

Pandas .shape,.size and .ndim are used to return size, shape and dimensions of data frames and series.

Create a DataFrame

import pandas as pd
import numpy as np
d={‘Name’:pd.Series(['Tom','James','Ricky','Vin',    'Steve']),'Age':pd.Series([25,26,25,23,30]),'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20])}
#Create a DataFrame 
df = pd.DataFrame(d)
print df

Output:

	Age	Name	Rating
0	25	Tom	4.23
1	26	James	3.24
2	25	Ricky	3.98
3	23	Vin	2.56
4	30	Steve	3.20

.shape Returns a tuple representing the dimensionality of the DataFrame. Tuple (a,b), where a represents the number of rows and b represents the number of columns.

df.shape

Output:

(5, 3) //5 Rows & 3 Columns

.size Returns the number of elements in the DataFrame.

df.size

Output:

21 // The total number of elements in our object is:

.ndim Returns the number of dimensions of the object. By definition, DataFrame is a 2D object

df.ndim

Output:

2 // The dimension of the object is

Pandas .info() function is used to print a concise summary of a DataFrame. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

Consider the following DataFrame df

	int_col	text_col	float_col
0	1	alpha	0.00
1	2	beta	0.25
2	3	gamma	0.50
3	4	delta	0.75
4	5	Epsilon	1.00

df.info()

Output:

<class ‘pandas.core.frame.DataFrame’>  
RangeIndex: 5 entries, 0 to 4          
Data columns (total 3 columns):        
float_col    5 non-null float64        
int_col      5 non-null int64          
text_col     5 non-null object         
dtypes: float64(1), int64(1), object(1)
memory usage: 192.0+ bytes

Pandas .describe() function computes a summary of statistics pertaining to the DataFrame columns. This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns

df.describe()

Output:

	float_col	int_col
count	5.000000	5.000000
mean	0.500000	3.000000
std	0.395285	1.581139
min	0.000000	1.000000
25%	0.250000	2.000000
50%	0.500000	3.000000
75%	0.750000	4.000000
max	1.000000	5.000000

Pandas .value_counts() function returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

Consider a DataFrame

	Student
0	Harry
1	Mike
2	Arther
3	Harry
4	Arther

Output:

Harry     2 
Arther    2 
Mike      1 
Nick      1 
Name: Student, dtype: int64

Pandas is a huge concept where we will learn all its components in a step by step manner. In the next article we will discuss about data selection, data cleaning, filtering, sorting, group-by, joining and combining of the dataset

Data Inspection Using Pandas, Data Reading, read and write tabular data

Share this article

Steven Roger

Steven Roger is a technology blogger, where he brings complex tech concepts to life with clear, engaging insights. With a passion for IT education and over a decade of industry experience, Steven specializes in demystifying the latest in software development, business analysis, and quality assurance training. His articles provide readers with practical knowledge and tips on upskilling for successful careers in tech.

All Posts