Data Cleansing Using Pandas

Data scientists spend a large amount of their time cleaning datasets and getting them down to a form with which they can work. A lot of data scientists argue that the initial steps of obtaining and cleaning data constitute 80% of the job. Therefore, if you are just stepping into this field or planning to step into this field, it is important to be able to deal with messy data, whether that means missing values, inconsistent formatting, malformed records, or nonsensical outliers.

In this article, we will cover a few pandas libraries that are used to clean the data.

Functions Used for Data Cleaning

After reading the data set into a data frame using .read_csv( ) we will try to clean the Data using different functions

Why Should we Rename Columns and Index

If your data was generated by a computer program, it probably has some computer-generated column names, too. Those can be hard to read and understand while working, so if you want to rename a column to something more user-friendly, you can do it using df.rename()

Consider the following DataFrame

	A	B
0	1	4
1	2	5

df.rename(columns={"A": "a", "B": "b"})

Output:

	a	b
0	1	4
1	2	5

We can also rename the index using .rename()

df.rename(index={0: "x", 1: "y"})

	a	b
x	1	4
y	2	5

Missing data is always a problem in real-life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of the poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid.

When and Why Is Data Missed?

Let us consider an online survey for a product. Many times, people do not share all the information related to them. Few people share their experience, but not how long they are using the product; few people share how long they are using the product, their experience but not their contact information. Thus, in some or the other way, a part of data is always missing, and this is very common in real-time.

In Pandas missing data is represented by two value:

None: None is a Python singleton object that is often used for missing data in Python code.

NaN: NaN (an acronym for Not a Number), is a special floating point value recognized by all systems that use the standard IEEE floating-point representation.

Let us now see how to identify missing values in our Data Set

Consider the following Data Set

	one	two	three
a	-1.359063	1.613255	-0.669396
b	NaN	NaN	NaN
c	0.885117	0.609271	0.330818
d	NaN	NaN	NaN
e	-0.136086	1.132808	0.496091
f	0.210065	0.533174	0.111560
g	NaN	NaN	NaN
h	1.027689	0.630037	0.727022

Now we will find the missing values in the data set using the function .isnull( )

df.isnull()

Output:

	one	two	three
a	False	False	FALSE
b	True	True	True
c	False	False	FALSE
d	True	True	True
e	False	False	FALSE
f	False	False	FALSE
g	True	True	True
h	False	False	FALSE

.isnull() checks every column for NULL values and a boolean series is returned by the isnull() method which stores True for NaN value and False for a Not null value.

We can also use .notnull() function to find the null values. It is opposite of .isnull()

df.notnull()

Output:

	one	two	three
a	True	True	True
b	False	False	FALSE
c	True	True	True
d	False	False	FALSE
e	True	True	True
f	True	True	True
g	False	False	FALSE
h	True	True	True

.notnull() checks every column is checked for NULL values and a boolean series is returned by the notnull() method which stores True for every NON-NULL value and False for a null value.

How to Drop rows with Nan values

There are several options for handling missing values each with its PROS and CONS. However, the choice of what should be done is largely dependent on the nature of our data and the missing values. Below is a summary highlight of several options we have for handling missing values.

Drop the missing values

Fill the missing values

Drop the missing values

.dropna() function this function drop Rows Columns off datasets with Null values

Consider the following Data Set

	one	two	three
a	-1.359063	1.613255	-0.669396
b	NaN	NaN	NaN
c	0.885117	0.609271	0.330818
d	NaN	NaN	NaN
e	-0.136086	1.132808	0.496091
f	0.210065	0.533174	0.111560
g	NaN	NaN	NaN
h	1.027689	0.630037	0.727022

df.dropna()

Output:


	one	two	three
a	-1.359063	1.613255	-0.669396
c	0.885117	0.609271	0.330818
e	-0.136086	1.132808	0.496091
f	0.210065	0.533174	0.111560
h	1.027689	0.630037	0.727022

Fill the missing values

Pandas df.replace() function is used to replace a string, regex, list, dictionary, series, number etc. from a dataframe. This is a very rich function as it has many variations.

Using this .replace() function we can replace we replace all the NaN with whatever value’s we like

Now we will replace all the NaN with ‘0’

df.replace(np.nan,0) //(orginal value,replaced value)

Output:

	one	two	three
a	-1.359063	1.613255	-0.669396
b	0	0	0
c	0.885117	0.609271	0.330818
d	0	0	0
e	-0.136086	1.132808	0.496091
f	0.210065	0.533174	0.111560
g	0	0	0
h	1.027689	0.630037	0.727022

You can also replace the NaN with mean, median, mode and we can also impute values using machine learning models .

We will learn about statistical summary calculations in next article’s

Now we will learn how to convert the datatypes of the variables

When doing data analysis, it is important to make sure you are using the correct data types; otherwise, you may get unexpected results or errors. In the case of pandas, it will correctly infer data types in many cases and you can move on with your analysis without any further thought on the topic.

Despite how well pandas works, at some point in your data analysis processes, you will likely need to explicitly convert data from one type to another.

Now we will discuss the basic pandas data types, how they map to python and numpy data types, and the options for converting from one pandas type to another.

Pandas Data Types

A data type is essentially an internal construct that a programming language uses to understand how to store and manipulate data. For instance, a program needs to understand that you can add two numbers together like 5 + 10 to get 15. Or, if you have two strings such as “cat” and “dog” you could concatenate (add) them together to get “catdog.”

A possible confusing point about pandas data types is that there is some overlap between pandas, and numpy. This table summarizes the key points:

Pandas dtype	NumPy type	Usage
object	string_, unicode_, mixed types	Text or mixed numeric
int64	int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64	Integer numbers
float64	float_, float16, float32, float64	Floating point numbers
bool	bool_	True/False values
datetime64	datetime64[ns]	Date and time values
category	NA	Finite list of text values

Now we will focus on the following pandas data types and learn how to convert them from one form to another form

object

int64

float64

Consider a DataFrame

	Age	Height	Weight
Krishna	25.5	5.40	45.5
Ram	45.0	5.11	50.0

As we all know that age is an integer value but if we closely observe the data frame age values is in float form

We can check the datatypes of every column by using .dtypes

df.dtypes

Age       float64
Height    float64
Weight    float64
dtype: object

It is showing that Age is a float in-order to convert it into int we will use .astype()

df[‘Age’].astype(‘int32’)

Ram        25
Krishna    45
Name: Age, dtype: int32

Now we can see that Age column is now converted into int datatype

We will see the rest of the pandas applications in the next article……

Data Cleaning Functions, Data Cleansing Using Pandas, Pandas Data Types

Share this article

Steven Roger

Steven Roger is a technology blogger, where he brings complex tech concepts to life with clear, engaging insights. With a passion for IT education and over a decade of industry experience, Steven specializes in demystifying the latest in software development, business analysis, and quality assurance training. His articles provide readers with practical knowledge and tips on upskilling for successful careers in tech.

All Posts