There is just something extraordinary about a well-designed visualization. The colors stand out, the layers blend nicely together, the contours flow throughout, and the overall package not only has a nice aesthetic quality, but it provides meaningful insights to us as well. This is quite important in data science where we often work with a lot of messy data. Having the ability to visualize it is critical for a data scientist. Our stakeholders or clients will more often than not rely on visual cues rather than the intricacies of a machine learning model. There are plenty of excellent Python visualization libraries available, including the built-in Matplotlib. Matplotlib has proven to be an incredibly useful and popular visualization tool, but even avid users will admit it often leaves much to be desired. An answer to all these problems is Seaborn
What is Seaborn?
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Seaborn provides simple high-level functions for common statistical plot types and integrates with the functionality provided by Pandas DataFrames.
Why should we use Seaborn over matplotlib?
• Matplotlib functions don’t work well with data frames, whereas seaborn does
• Seaborn comes with a large number of high-level interfaces and customized themes that matplotlib lacks as it’s not easy to figure out the settings that make plots attractive
• Matplotlib’s API is a relatively low level. Doing sophisticated statistical visualization is possible, but often requires a lot of boilerplate code.
• Matplotlib predated Pandas by more than a decade and thus is not designed for use with Pandas DataFrames. To visualize data from a Pandas DataFrame, you must extract each Series and often concatenate them together in the right format. It would be nicer to have a plotting library that can intelligently use the DataFrame labels in a plot.
How to install Seaborn
To install Seaborn and use it effectively, first, we need to install the aforementioned dependencies. The following are the four mandatory dependencies you need to have
- NumPy ( Version >= 1.9.3)
- SciPy ( Version >= 0.14.0)
- matplotlib ( Version >= 1.4.3)
- Pandas ( Version >= 0.15.2)
Some of the Optional dependencies you need to have • Statsmodels, for advanced regression plots
• Fastcluster, for clustering large matrices
Once this step is done, we are all set to install Seaborn and enjoy its mesmerizing plots. To install Seaborn, you can use the following line of code
To install the latest release of seaborn, you can use pip
pip install seaborn
You can also use
conda to install the latest version of seaborn: conda install seaborn To import the dependencies and seaborn itself in your code, you can use the following code
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from scipy import stats
Data Visualization using Seaborn
The following are the list of plots we can perform using seaborn • Relational plots
• Distribution plots
• Categorical plots
• Regression plots
• Matrix plots
• Pair plots & Joint plots
Statistical analysis is a process of understanding how variables in a dataset relate to each other and how those relationships depend on other variables. Visualization can be a core component of this process because, when data are visualized properly, the human visual system can see trends and patterns that indicate a relationship The one we will use most is relplot(). This is a figure-level function for visualizing statistical relationships using two common approaches scatter plots and line plots.
- scatterplot() ## Draw a scatter plot with the possibility of several semantic groupings.
- lineplot() ## Draw a line plot with the possibility of several semantic groupings.
Whenever we are dealing with a dataset, we want to know how the data or the variables are being distributed.
distplot()is the figure-level function for visualizing distribution plots. Distribution of data could tell us a lot about the nature of the data, as we all know that there are two types
- distplot() ## Flexibly plot a univariate distribution of observations.
- histplot() ## Plot univariate or bivariate histograms to show distributions of datasets.
- kde plot() ## Plot univariate or bivariate distributions using kernel density estimation.
- ecdfplot() ## Plot empirical cumulative distribution functions.
- rugplot() ## Plot marginal distributions by drawing ticks along the x and y axes.
In this section, we’ll see the relationship between two variables of which one would be categorically divided into different groups. We’ll be using catplot() is the figure-level function of a seaborn library to draw the plots of categorical data
- barplot() ## Show point estimates and confidence intervals as rectangular bars.
- countplot() ## Show the counts of observations in each categorical bin using bars.
- boxplot() ## Draw a box plot to show distributions concerning categories.
- voilinplot() ## Combination of boxplot and kernel density estimate.
- point plot() ## Show point estimates and confidence intervals using scatter plot glyphs.
- swarmplot() ## Draw a categorical scatterplot with non-overlapping points.
The regression plots in seaborn are primarily intended to add a visual guide that helps to emphasize patterns in a dataset during exploratory data analyses. Regression plots as the name suggest create a regression line between 2 parameters and help to visualize their linear relationships.
- regplot() ## Plot data and a linear regression model fit.
- residplot() ## Plot the residuals of linear regression.
A matrix plot is a plot of matrix data. A matrix plot is a color-coded diagram that has rows of data, columns of data, and values.
- heatmap() ## Plot rectangular data as a color-encoded matrix.
- clustermap() ## Plot a matrix dataset as a hierarchically clustered heatmap.
Pair plots & Joint plots
We can also plot multiple bivariate distributions in a dataset by using the seaborn library. This shows the relationship between each column of the database. It also draws the univariate distribution plot of each variable on the diagonal axis.
- pairplot() ## Plot pairwise relationships in a dataset.
- jointplot() ## Draw a plot of two variables with bivariate and univariate graphs. Using all these plots we can perform data analysis and draw meaningful conclusions. We can also quickly see trends and outliers. If we can see something, we internalize it quickly.
Seaborn comes with a few important inbuilt datasets in the library. When Seaborn is installed, the datasets download automatically. You can use any of these datasets for your learning. With the help of the following function you can load the required dataset
import seaborn as sns sns.load_dataset() load_dataset (name, cache: bool=True,**kws) name : name of the dataset ( name.csv on https://github.com/ mwaskom/seaborn-data ).
cache : boolean, optional
If True, then cache data locally and use the cache on subsequent calls kws : dict, optional Passed to pandas.read_csv To view all the available data sets in the Seaborn library, you can use the following command with the get_dataset_names() function as shown below
['anagrams','anscombe','attention','brain_networks', 'car_crashes','diamonds','dots','exercise','flights', 'fmri','gammas','geyser','iris','mpg','penguins', ‘planets’,'tips','titanic']
This is the list of all available datasets in seaborn. We can use these datasets for our practise purpose.
Now we will learn how to import these datasets.
Importing Data as Pandas DataFrame
Now, we will import a dataset. This dataset loads as Pandas DataFrame by default. If there is any function in the Pandas DataFrame, it works on this DataFrame.
import seaborn as sns
df = sns.load_dataset(‘tips’) df.head()
We imported inbuilt tips data set from seaborn
The above line of code will generate the following output
In the next article, we will learn how to visualize all the seaborn plots.