{"id":5733,"date":"2020-10-19T16:48:27","date_gmt":"2020-10-19T11:18:27","guid":{"rendered":"https:\/\/www.h2kinfosys.com\/blog\/?p=5733"},"modified":"2020-10-19T16:48:29","modified_gmt":"2020-10-19T11:18:29","slug":"visualizing-univariate-distributions-using-seaborn","status":"publish","type":"post","link":"https:\/\/www.h2kinfosys.com\/blog\/visualizing-univariate-distributions-using-seaborn\/","title":{"rendered":"Visualizing univariate distributions using seaborn"},"content":{"rendered":"\n<p>An early step in any effort to analyze data should be to understand how the variables are distributed. Techniques for distribution visualization can provide quick answers to many important questions such as.&nbsp;&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>What range do the observations cover?&nbsp;<\/li><li>What is their central tendency?&nbsp;<\/li><li>Are they heavily skewed in one direction?&nbsp;&nbsp;<\/li><li>Is there evidence for bimodality?&nbsp;<\/li><li>Are there significant outliers?<\/li><li>Do the answers to these questions vary across subsets defined by other variables?&nbsp;<\/li><\/ul>\n\n\n\n<p>There are several distribution plots designed to answer all these questions such as these.&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>histplot()&nbsp;&nbsp;<\/li><li>displot()&nbsp;&nbsp;<\/li><li>kdeplot()&nbsp;&nbsp;<\/li><li>ecdfplot()&nbsp;&nbsp;<\/li><li>rugplot()&nbsp;<\/li><\/ul>\n\n\n\n<p>There are several different approaches to visualizing a distribution, and each has its relative advantages and drawbacks. It is important to understand these factors so that you can choose the best approach for your particular aim.&nbsp;<\/p>\n\n\n\n<p>Now we try to plot all these plots and perform data analysis. In order to perform this analysis, we will use the <a href=\"https:\/\/www.h2kinfosys.com\/blog\/introduction-to-seaborn\/\">seaborn <\/a>load_dataset()\u00a0 function and use it to build a dataset for our analysis.\u00a0<\/p>\n\n\n\n<p>Now we will import all the necessary libraries<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd  \nimport numpy as np  \nimport seaborn as sns  \ndf = sns.load_dataset(\"penguins\")  df.head()  <\/code><\/pre>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><\/td><td><strong>species<\/strong><\/td><td><strong>island<\/strong><\/td><td><strong>bill_length_mm<\/strong><\/td><td><strong>bill_depth_mm<\/strong><\/td><td><strong>flipper_length_mm<\/strong><\/td><td><strong>body_mass_g<\/strong><\/td><td><strong>sex<\/strong><\/td><\/tr><tr><td><strong>0<\/strong><\/td><td><strong>Adelie<\/strong><\/td><td><strong>Torgersen<\/strong><\/td><td><strong>39.1<\/strong><\/td><td><strong>18.7<\/strong><\/td><td><strong>181.0<\/strong><\/td><td><strong>3750.0<\/strong><\/td><td><strong>Male<\/strong><\/td><\/tr><tr><td><strong>1<\/strong><\/td><td><strong>Adelie<\/strong><\/td><td><strong>Torgersen<\/strong><\/td><td><strong>39.5<\/strong><\/td><td><strong>17.4<\/strong><\/td><td><strong>186.0<\/strong><\/td><td><strong>3800.0<\/strong><\/td><td><strong>Female<\/strong><\/td><\/tr><tr><td><strong>2<\/strong><\/td><td><strong>Adelie<\/strong><\/td><td><strong>Torgersen<\/strong><\/td><td><strong>40.3<\/strong><\/td><td><strong>18.0<\/strong><\/td><td><strong>195.0<\/strong><\/td><td><strong>3250.0<\/strong><\/td><td><strong>Female<\/strong><\/td><\/tr><tr><td><strong>4<\/strong><\/td><td><strong>Adelie<\/strong><\/td><td><strong>Torgersen<\/strong><\/td><td><strong>36.7<\/strong><\/td><td><strong>19.3<\/strong><\/td><td><strong>193.0<\/strong><\/td><td><strong>3450.0<\/strong><\/td><td><strong>Female<\/strong><\/td><\/tr><tr><td><strong>5<\/strong><\/td><td><strong>Adelie<\/strong><\/td><td><strong>Torgersen<\/strong><\/td><td><strong>39.3<\/strong><\/td><td><strong>20.6<\/strong><\/td><td><strong>190.0<\/strong><\/td><td><strong>3650.0<\/strong><\/td><td><strong>Male<\/strong><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><strong>Histograms&nbsp;&nbsp;<\/strong><\/p>\n\n\n\n<p>Perhaps the most common approach to visualizing a distribution is the histogram. This is the default approach in <strong>displot()<\/strong>, which uses the same underlying code as <strong>histplot()<\/strong>.&nbsp;&nbsp;<\/p>\n\n\n\n<p>A histogram is a bar plot where the axis representing the data variable&nbsp; is divided into a set of discrete bins and the count of observations&nbsp; falling within each bin is shown using the height of the corresponding&nbsp;&nbsp;<\/p>\n\n\n\n<p><code>sns.displot(df, x=\u201cflipper_length_mm\")<\/code><\/p>\n\n\n\n<p>or<\/p>\n\n\n\n<p><code>sns.histplot(df, x=\u201cflipper_length_mm\")<\/code><\/p>\n\n\n\n<p><strong>Output:&nbsp;<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/MLc0A8ZDHynaYDlaEtPqxU6eKO2PHXVI2DsfE_Aqooty6FXB-tXR-jSzn_lN2PpGNOyUbuCejU-25gGdiNNc4dQ6z_WjoYnN1k9w5sCaXabIyxDZOsC_CGNaLBKW4DhT-s5lJC4JMOiGb6In7A\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>This plot immediately affords a few insights about the&nbsp; flipper_length_mm variable. For instance, we can see that the most common flipper length is about 195 mm, but the distribution appears bimodal, so this one number does not represent the data well.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Choosing the bin size&nbsp;&nbsp;<\/strong><\/h2>\n\n\n\n<p>The size of the bins is an important parameter, and using the wrong bin size can mislead by obscuring important features of the data or by creating apparent features out of random variability. By default,&nbsp; displot() \/ histplot() choose a default bin size based on the variance of the data and the number of observations.&nbsp;&nbsp;<\/p>\n\n\n\n<p>But you should not be over-reliant on such automatic approaches,&nbsp; because they depend on particular assumptions about the structure of your data.&nbsp;&nbsp;<\/p>\n\n\n\n<p>It is always advisable to check that your impressions of the distribution are consistent across different bin sizes. To choose the size directly, set&nbsp; the bin-width parameter&nbsp;<\/p>\n\n\n\n<p><code>sns.displot(df, x=\"flipper_length_mm\",bins=30)&nbsp;<\/code><\/p>\n\n\n\n<p><strong>Output:&nbsp;<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong><\/strong><strong>Conditioning on other variables&nbsp;&nbsp;<\/strong><\/h2>\n\n\n\n<p>Once you understand the distribution of a variable, the next step is often to ask whether features of that distribution differ across other variables in the dataset. For example, what accounts for the bimodal distribution of flipper lengths that we saw above?&nbsp;&nbsp;<\/p>\n\n\n\n<p>displot() and histplot() provide support for conditional subsetting via the hue semantic. Assigning a variable to a hue will draw a&nbsp; separate histogram for each of its unique values and distinguish them&nbsp; by color&nbsp;<\/p>\n\n\n\n<p><code>sns.displot(df, x=\u201cflipper_length_mm\u201d,hue='species')&nbsp; <\/code>or&nbsp;<code>sns.histplot(df, x=\u201cflipper_length_mm\u201d,hue='species')&nbsp;<\/code><\/p>\n\n\n\n<p><strong>Output:&nbsp;<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong><\/strong><strong>KDE plot ( Kernel density estimation )&nbsp;&nbsp;<\/strong><\/h2>\n\n\n\n<p>A histogram aims to approximate the underlying probability density function that generated the data by binning and counting observations.&nbsp; Kernel density estimation (KDE) presents a different solution to the same problem. Rather than using discrete bins, a KDE plot smooths the&nbsp; observations with a Gaussian kernel, producing a continuous density&nbsp; estimate&nbsp;<\/p>\n\n\n\n<p><code>sns.displot(df, x=\"flipper_length_mm\", kind=\u201ckde\")<\/code><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/auBmycaIDJWL_uIXx8XrMvFdkrtlUxYF-wbwARBmQ7yaWornttAC9BzhO6KgdFnYV2HzgmqnzQFmhrYcg9Ysj1-yk7HdsjN52faS08YIvrXpfLfGFJFqB_J-rwQf6mXluyCLzKaHG9eSSitK7Q\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Choosing the smoothing bandwidth&nbsp;&nbsp;<\/strong><\/h2>\n\n\n\n<p>Much like with the bin size in the histogram, the ability of the KDE to accurately represent the data depends on the choice of smoothing bandwidth. An over-smoothed estimate might erase meaningful features, but an under-smoothed estimate can obscure the true shape within random noise. The easiest way to check the robustness of the&nbsp; estimate is to adjust the default bandwidth&nbsp;<\/p>\n\n\n\n<p><code>sns.displot(df, x=\u201cflipper_length_mm\", kind=\u201ckde\",&nbsp; bw_adjust=.25)&nbsp;&nbsp;<\/code><\/p>\n\n\n\n<p><strong>Output:&nbsp;&nbsp;<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/Fpwtp9ly4V9P8AXx5FCTjTSR7aaxv_BBbNNTjkYQmGS_cLDA0gxljDxh78Jn6rWPF8aTj41XwWZUN-yNB0FeAnZ10ycgSFhd9dp6_QlrxC7JiDZ8SjP11Db1zdEF9q2wwlXKcuXsiDdY8bylAA\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Conditioning on other variables&nbsp;&nbsp;<\/strong><\/h3>\n\n\n\n<p>As with histograms, if you assign a hue variable, a separate density&nbsp; estimate will be computed for each level of that variable<\/p>\n\n\n\n<p><code>sns.displot(df, x=\u201cflipper_length_mm\u201d, kind=\u201ckde\",&nbsp; hue=\u2018species\u2019)&nbsp;&nbsp;<\/code><\/p>\n\n\n\n<p><strong>Output:&nbsp;<\/strong><strong><br><\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/aaX7STDqjddinglKXRdi0ZiA5XaPzJ5URtWnCNquPpjb_EiFDtvxVyfuVSoo-sicf_vZLpVXzHPtx_qTdJVDfl8cxdyWtz5g6KwVu_Fvku8FDWTJSbHrIdtUaHhgdfELV1C05QZ54KEHoc_FzQ\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>In many cases, the layered KDE is easier to interpret than the layered\u00a0 histogram, so it is often a good choice for the task of comparison\u00a0As a compromise, it is possible to combine these two approaches.\u00a0 While in <a href=\"https:\/\/en.wikipedia.org\/wiki\/Histogram\" rel=\"nofollow noopener\" target=\"_blank\">histogram mode<\/a>, <strong>displot() <\/strong>also with <strong>histplot() <\/strong>has the\u00a0 option of including the smoothed KDE curve (note kde=True, not\u00a0 kind=\u201ckde\u201d)\u00a0<\/p>\n\n\n\n<p><code>sns.displot(df, x=\u201cflipper_length_mm\",kde=True)<\/code><\/p>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/KmQ5ph9lzgGXYQ3xhDPxcAgU-MgyFq0VfwSTEFgvVWhox8XBl8HKDpkQ2fBk3ROaziX14jkfjfEso2BA8JlTHAr79Savg4SYCHtnLg46X0dSHnkJpj6TyDOSp-SfxAd7aV_7Gev38-85LFNAjA\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>ECDF plots ( \u201cempirical cumulative distribution function)&nbsp;&nbsp;<\/strong><\/h2>\n\n\n\n<p>A third option for visualizing distributions computes the \u201cempirical cumulative distribution function\u201d (ECDF). This plot draws a&nbsp; monotonically-increasing curve through each data point such that the&nbsp; height of the curve reflects the proportion of observations with a smaller&nbsp; value:&nbsp;<\/p>\n\n\n\n<p><code>sns.displot(penguins, x=\u201cflipper_length_mm\u201d,&nbsp; kind=\"ecdf\")&nbsp;<\/code><\/p>\n\n\n\n<p>or&nbsp;<\/p>\n\n\n\n<p><code>sns.ecdfplot(penguins, x=\u201cflipper_length_mm\u201d)<\/code><\/p>\n\n\n\n<p><strong>Output:&nbsp;<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/0ozjI0tzZinS5z5KZTQcssu2P8otyHzGnKD1tFcF4O5l22wNfjdgoX4xsEd65zeos0U2c4z0Gv_4qm3mnxwKwlSZ1_Rp6SVbhhoEoPjsxSJgSnow4fck3PA1BLJwUrLcNI22zj8Ri0d9ZW-bjw\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>The ECDF plot has two key advantages. Unlike the histogram or KDE, it directly represents each data point. That means there is no bin size or smoothing parameter to consider. Additionally, because the curve is&nbsp; monotonically increasing, it is well-suited for comparing multiple&nbsp; distributions<\/p>\n\n\n\n<p><code>sns.displot(penguins, x=\u201cflipper_length_mm\u201d,&nbsp; kind=\u201cecdf\", hue=\u201cspecies\")&nbsp;&nbsp;<\/code><\/p>\n\n\n\n<p><strong>Output:&nbsp;&nbsp;<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/UEQG8eit2xqvwD7sMTdaiJnSSCsh80peiF6QPOyL7C7QJgfzmnOptA61hs-6hphBuUiBr63wB1c4aPrxogyhAFb_dQ-XhzqjNpWvuhjq9Sv5lsxXVeifJM7PYsj2WPqsCf_k_qGALKQL1AOo9g\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>The major downside to the ECDF plot is that it represents the shape of the distribution less intuitively than a histogram or density curve.&nbsp; Consider how the bimodality of flipper lengths is immediately apparent in the histogram, but to see it in the ECDF plot, you must look for varying slopes. Nevertheless, with practice, you can learn to answer all of the important questions about distribution by examining the ECDF,&nbsp; and doing so can be a powerful approach.&nbsp;<\/p>\n\n\n\n<p><strong>Rug plot&nbsp;&nbsp;<\/strong><\/p>\n\n\n\n<p>The rug is not a separate plot. It is a one-dimensional display that you can add to existing plots to illuminate information that is sometimes lost in other types of graphs. Like a strip plot, it represents values of a variable by putting a symbol at various points along an axis.&nbsp; However, it uses short lines to represent points. You can place it at the&nbsp; bottom (the default) or top of a graph (side = 3)<\/p>\n\n\n\n<p><code>sns.displot(df,x=\u201cflipper_length_mm\u201d,kind=\u201ckde\u201d,&nbsp; rug=True)&nbsp;&nbsp;<\/code><\/p>\n\n\n\n<p><strong>Output:&nbsp;<\/strong><\/p>\n\n\n\n<p><strong><\/strong>In the next article we will learn how to visualize bivariate distributions<\/p>\n","protected":false},"excerpt":{"rendered":"<p>An early step in any effort to analyze data should be to understand how the variables are distributed. Techniques for distribution visualization can provide quick answers to many important questions such as.&nbsp;&nbsp; What range do the observations cover?&nbsp; What is their central tendency?&nbsp; Are they heavily skewed in one direction?&nbsp;&nbsp; Is there evidence for bimodality?&nbsp; [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":5821,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[500],"tags":[],"class_list":["post-5733","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science-using-python-tutorials"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/5733","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/comments?post=5733"}],"version-history":[{"count":0,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/5733\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media\/5821"}],"wp:attachment":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media?parent=5733"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/categories?post=5733"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/tags?post=5733"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}