Data Science using Python Tutorials

# Visualizing bivariate distribution using seaborn

Now we will assign a second variable to y, and the resultant is a  bivariate distribution. We will use the same penguins’ dataset here.

In the previous article, all of the examples are related to univariate distributions (distributions of a single variable), perhaps conditional on a  second variable assigned to hue.

`sns.displot(penguins,x=“bill_length_mm”,  y=“bill_depth_mm")  `

Output :

Assigning a hue variable will plot multiple heatmaps or contour sets using different colors. For bivariate histograms, this will only work well if  there is minimal overlap between the conditional distributions

```sns.displot(penguins, x=“bill_length_mm",   y="bill_depth_mm", hue=“species")
```

Output : A bivariate histogram bins the data within rectangles that tile the plot and then shows the count of observations within each rectangle with the fill color. Similarly, a bivariate KDE plot smoothes the (x, y)  observations with a 2D Gaussian. The default representation then  shows the contours of the 2D density

```sns.displot(penguins, x="bill_length_mm",  y="bill_depth_mm", kind=“kde”)
```

Output :

The contour approach of the bivariate KDE plot lends itself better to  evaluating overlap, although a plot with too many contours can get  busy

`sns.displot(penguins, x="bill_length_mm",  y="bill_depth_mm", hue="species", kind=“kde”)  `

Output :

Just as with univariate plots, the choice of bin size or smoothing bandwidth will determine how well the plot represents the underlying bivariate distribution.

The same parameters apply, but they can be tuned for each variable by  passing a pair of values

`sns.displot(penguins, x="bill_length_mm",  y="bill_depth_mm", binwidth=(2, .5)) `

Output :

To aid interpretation of the heatmap, add a colorbar to show the  mapping between counts and color intensity

`sns.displot(penguins, x=“bill_length_mm", cbar=True  y="bill_depth_mm", binwidth=(2, .5))`

Output :

The meaning of the bivariate density contours is less straightforward.  Because the density is not directly interpretable, the contours are drawn at iso-proportions of the density, meaning that each curve shows a level set such that some proportion p of the density lies below it.

The p values are evenly spaced, with the lowest level contolled by the  thresh parameter and the number controlled by levels

`sns.displot(penguins, x="bill_length_mm",  y="bill_depth_mm", kind="kde", thresh=.2, levels=4)  `

Output :

The bivariate distribution histogram allows one or both variables to be discrete.  Plotting one discrete and one continuous variable offers another way to  compare conditional univariate distributions.

`sns.displot(df,x=“species”,y=“body_mass_g",hue='sex') `

Output :

In contrast, plotting two discrete variables is an easy to way show the  cross-tabulation of the observations

`sns.displot(df, x="species", y=“island”)  Output :`

## Visualizing statistical relationships using seaborn

We will discuss three seaborn functions in this tutorial.

• relplot()

• scatterplot()

• lineplot()

As we will see, these functions can be quite illuminating because they use simple and easily-understood representations of data that can nevertheless represent complex dataset structures. They can do so because they plot two-dimensional graphics that can be enhanced by mapping up to three additional variables using the semantics of hue,  size, and style.

## Relating variables with scatter plots

The scatter plot is a mainstay of statistical visualization. It depicts the joint distribution of two variables using a cloud of points, where each point represents an observation in the dataset. This depiction allows the eye to infer a substantial amount of information about whether there is any meaningful relationship between them.

There are several ways to draw a scatter plot in seaborn. The most basic, which should be used when both variables are numeric, is the scatterplot() function.

In the categorical visualization tutorial, we will see specialized tools for using scatterplots to visualize categorical data. The scatterplot() is the default kind in relplot().

Here we will use the tips dataset from seaborn

```df=sns.load_dataset(‘tips')

Output :

`sns.relplot(x="total_bill", y="tip", data=df)  Output : `

While the points are plotted in two dimensions, another dimension can be added to the plot by coloring the points according to a third variable.  In seaborn, this is referred to as using a “hue semantic”, because the  color of the point gains meaning

`sns.relplot(x="total_bill", y=“tip", hue=“smoker",  data=df)  `

Output :

Unlike with matplotlib.pyplot.scatter(), the literal value of the variable is not used to pick the area of the point. This range can be  customized

`sns.relplot(x="total_bill", y="tip", size="size",  sizes=(15, 200), data=tips)  `

Output :

Emphasizing continuity with line plots

Scatter plots are highly effective, but there is no universally optimal type of visualization. Instead, the visual representation should be adapted for the specifics of the dataset and to the question you are trying to answer with the plot.

With some datasets, you may want to understand changes in one variable as a function of time or a similarly continuous variable. In this situation, a good choice is to draw a line plot. In seaborn, this can be  accomplished by the lineplot() function, either directly or with  relplot() by setting kind=“line”

`df=pd.DataFrame(dict(time=np.arange(500),  value=np.random.randn(500).cumsum()))  `

Output :

## Aggregation and representing uncertainty

More complex datasets will have multiple measurements for the same value of the x variable. The default behavior in seaborn is to aggregate the multiple measurements at each x value by plotting the mean and the 95% confidence interval around the mean. We will use “fmri”

dataset for this

Output :

subject timepoint event region signal

0 s13 18 stim parietal -0.017552

1 s5 14 stim parietal -0.080883

2 s12 18 stim parietal -0.081033

3 s11 18 stim parietal -0.046134

4 s10 18 stim parietal -0.037970

`sns.relplot(x="timepoint", y="signal", kind="line",  data=df)  `

Output :

Another good option, especially with larger data, is to represent the  spread of the distribution at each time point by plotting the standard  deviation instead of a confidence interval

`sns.relplot(x="timepoint", y="signal", kind="line",  ci="sd", data=df);  `

Output :

Plotting subsets of data with semantic mappings

The lineplot() function has the same flexibility as scatterplot() it can show up to three additional variables by modifying the hue, size,  and style of the plot elements.

It does so using the same API as a scatterplot(), meaning that we don’t need to stop and think about the parameters that control the look of lines vs. points in matplotlib.

Using semantics in lineplot() will also determine how the data get aggregated. For example, adding a hue semantic with two levels splits the plot into two lines and error bands, coloring each to indicate which subset of the data they correspond to.

`sns.relplot(x="timepoint", y="signal", hue="event",  kind="line", data=df)  `

Output :

In the next article, we will learn how to plot categorical variables using  Seaborn

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Check Also
Close

Close
Close