{"id":5872,"date":"2020-10-21T17:01:28","date_gmt":"2020-10-21T11:31:28","guid":{"rendered":"https:\/\/www.h2kinfosys.com\/blog\/?p=5872"},"modified":"2020-10-21T17:01:30","modified_gmt":"2020-10-21T11:31:30","slug":"visualizing-categorical-data-using-seaborn","status":"publish","type":"post","link":"https:\/\/www.h2kinfosys.com\/blog\/visualizing-categorical-data-using-seaborn\/","title":{"rendered":"Visualizing categorical data using Seaborn"},"content":{"rendered":"\n<p>In the relational plot tutorial, we saw how to use different visual representations to show the relationship between multiple variables in a&nbsp; dataset. In the examples, we focused on cases where the main relationship was between two numerical variables. If one of the main variables is categorical ( divided into discrete groups ), it may be helpful to use a more specialized approach to visualization.&nbsp;<\/p>\n\n\n\n<p>In seaborn, there are several different ways to visualize a relationship involving categorical data. Similar to the relationship between relplot() and either scatterplot() or lineplot(), there are two ways to make these plots. There are several axes-level functions for plotting categorical data in different ways and a figure-level interface, catplot(), that gives unified higher-level access to them.&nbsp;<\/p>\n\n\n\n<p>It\u2019s helpful to think of the different categorical plot kinds as belonging to three different families, which we\u2019ll discuss in detail below.&nbsp;<\/p>\n\n\n\n<p><strong>Categorical scatterplots:&nbsp;&nbsp;<\/strong><\/p>\n\n\n\n<p>\u2022 stripplot() (with kind=&#8221;strip&#8221;; the default)&nbsp;<\/p>\n\n\n\n<p>\u2022 swarmplot() (with kind=\u201cswarm&#8221;)&nbsp;<\/p>\n\n\n\n<p><strong>Categorical distribution plots:&nbsp;&nbsp;<\/strong><\/p>\n\n\n\n<p>\u2022 boxplot() (with kind=\u201cbox&#8221;)&nbsp;<\/p>\n\n\n\n<p>\u2022 violinplot() (with kind=\u201cviolin&#8221;)&nbsp;<\/p>\n\n\n\n<p>\u2022 boxenplot() (with kind=\u201cboxen&#8221;)&nbsp;<\/p>\n\n\n\n<p><strong>Categorical estimate plots:&nbsp;&nbsp;<\/strong><\/p>\n\n\n\n<p>\u2022 pointplot() (with kind=\u201cpoint&#8221;)&nbsp;<\/p>\n\n\n\n<p>\u2022 barplot() (with kind=\u201cbar&#8221;)&nbsp;<\/p>\n\n\n\n<p>\u2022 countplot() (with kind=&#8221;count&#8221;)<\/p>\n\n\n\n<p>These families represent the data using different levels of granularity.&nbsp; When deciding which to use, you\u2019ll have to think about the question&nbsp; that you want to answer.&nbsp;&nbsp;<\/p>\n\n\n\n<p>The unified API makes it easy to switch between different kinds and see your data from several perspectives. In this tutorial, we\u2019ll mostly focus on the figure-level interface, catplot(). Remember that this function is a higher-level interface each of the functions above, so we\u2019ll reference them when we show each kind of plot, keeping the more verbose kind-specific API documentation at hand. We will use the tips data set.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">import seaborn as sns&nbsp;&nbsp;\nimport matplotlib.pyplot as plt&nbsp;&nbsp;tips = sns.load_dataset(\u201ctips\u201d)&nbsp;<\/pre>\n\n\n<p>[box type=&#8221;info&#8221; align=&#8221;&#8221; class=&#8221;&#8221; width=&#8221;&#8221;]<\/p>\n<p><b>total_bill tip sex smoker day time size<\/b> <span style=\"font-weight: 400;\">0 16.99 1.01 Female No Sun Dinner 2<\/span> <span style=\"font-weight: 400;\">1 10.34 1.66 Male No Sun Dinner 3<\/span> <span style=\"font-weight: 400;\">2 21.01 3.50 Male No Sun Dinner 3<\/span> <span style=\"font-weight: 400;\">3 23.68 3.31 Male No Sun Dinner 2<\/span> <span style=\"font-weight: 400;\">4 24.59 3.61 Female No Sun Dinner 4<\/span><\/p>\n<p>[\/box]<\/p>\n\n\n<p><strong>Categorical scatterplots&nbsp;&nbsp;<\/strong><\/p>\n\n\n\n<p>The default representation of the data in catplot() uses a scatterplot.\u00a0 There are two different categorical scatter <a href=\"https:\/\/www.h2kinfosys.com\/blog\/introduction-to-seaborn\/\">plots in seaborn<\/a>. They take different approaches to resolving the main challenge in representing categorical data with a scatter plot, which is that all of the points belonging to one category would fall on the same position along with the axis corresponding to the categorical variable.\u00a0\u00a0The approach used by strip plot(), which is the default kind in catplot() is to adjust the positions of points on the categorical axis with a small amount of random jitter.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">sns.catplot(x=\"day\", y=\"total_bill\", data=tips)<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/BlFH9Gwo4Y7RsJ6JZN3USNPMSijsk_q7MiZcxRockP2EOrIRvD8VSy4Wei9k_gOyfE9YtOE7zniko0GySNtCWubxWKXiUPftN0VxN1uwWwVLLjMkgSe4xJZMPsd2QbzXW7s7t6Dn\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>The jitter parameter controls the magnitude of jitter or disables it&nbsp; altogether.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">sns.catplot(x=\"day\", y=\"total_bill\", data=tips,&nbsp; jitter=False)<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/WlRpp-mv-Pr-zUTB1WHFVsayu6fDI_NJX6USQgzN7GCZ5h3Y61agG--NUixVk4nHPPhJgBt0JR0NlXb-6LI-9HqdMfxYffrHcKoHdsSJKDCCouO2CPiVUH7-5X-Zf9hS7jtxOdOT\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>The second approach adjusts the points along the categorical axis using an algorithm that prevents them from overlapping. It can give a better representation of the distribution of observations, although it only works well for relatively small datasets. This kind of plot is sometimes called a&nbsp; bees-warm and is drawn in seaborn by swarmplot(), which is activated by setting kind= &#8220;swarm&#8221; in catplot().&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">sns.catplot(x=\"day\", y=\"total_bill\", kind=\"swarm\", &nbsp;data=tips)<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/7vNKfUk-7HXr7taYcBhcXyPM5-X_X5HxDrRpBGefkj63nm6g54mNDBy69XLAVDXEoHqkbpAEj9l8KSTcmqklB1U0mSfj_jktzFlXdcJIO9bVD6d9h6wxOA8eAsalA4fWRgRZVUzN\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>Similar to the relational plots, it\u2019s possible to add another dimension to a categorical plot by using a hue semantic. The categorical plots do not currently support size or style semantics.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Each different categorical plotting function handles the <a href=\"https:\/\/seaborn.pydata.org\/tutorial\/relational.html\" rel=\"nofollow noopener\" target=\"_blank\">hue semantic<\/a> differently. For the scatter plots, it is only necessary to change the color of the points.\u00a0<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">sns.catplot(x=\"day\", y=\"total_bill\", hue=\u201csex\",&nbsp; kind=\"swarm\", data=tips)<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/MtkNANKxJThO_NHxbHUEAI4KhIX-Q8XsI91xGtRKChghmQfU7x22X0ccKbPlPs4oAHBYb-nTFjiYgsA0_c4gyQDpDXwgd5KvfPudj3dlL8GOXkDpV0zAYIiF4Ih-Pne5-vGSFFtq\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>Unlike with numerical data, it is not always obvious how to order the levels of the categorical variable along its axis. In general, the seaborn categorical plotting functions try to infer the order of categories from the data. If your data have a panda\u2019s Categorical data using Seaborn type, then the default order of the categories can be set there. If the variable passed to the categorical axis looks numerical, the levels will be sorted. But the data are still treated as categorical and drawn at ordinal positions on the categorical axes even when numbers are used to label them.\u00a0<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">sns.catplot(x=\"size\", y=\"total_bill\", data=tips)<img fetchpriority=\"high\" decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/kEEY-LNLgGCH7wocLv_ubGNutvtOYFlSPc7Gsu3ZMsDV8TlfqwPBfsSAu410jXiduCaGAor2l060C9aYA9ukMDv6yEVNrJl56tu5BjM6yYBJXaUe2lAjbdi2QvrfNeL2cueepSe_\" width=\"471\" height=\"392\" alt=\"\" title=\"\"><\/pre>\n\n\n\n<p>The other option for choosing a default ordering is to take the levels of&nbsp; the category as they appear in the dataset. The ordering can also be&nbsp; controlled on a plot-specific basis using the order parameter. This can&nbsp; be important when drawing multiple categorical plots in the same&nbsp; figure, which we\u2019ll see more of below.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">sns.catplot(x=\"smoker\", y=\"tip\", order=[\"No\", \u201cYes\u201d], &nbsp;data=tips)&nbsp;&nbsp;<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/Gb8HQDsrl3FtODSMILDLIuiLerCDWwR74wYMTpgjxOlHJGwCArFjPdKVF5fa0ZQeiAxWv2Ek64S5AOKnIbtzl4CDGK08NKf89Y_INDyn4929hdyLhZiEYwCurnSwpQc0AzFV6GTX\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Distributions of observations within categories&nbsp;<\/strong><\/h2>\n\n\n\n<p>As the size of the dataset grows, categorical scatter plots become&nbsp; limited in the information they can provide about the distribution of&nbsp; values within each category. When this happens, there are several&nbsp; approaches for summarizing the distributional information in ways that&nbsp; facilitate easy comparisons across the category levels.&nbsp;<\/p>\n\n\n\n<p><strong>Boxplots&nbsp;&nbsp;<\/strong><\/p>\n\n\n\n<p>The first is the familiar boxplot(). This kind of plot shows the three&nbsp; quartile values of the distribution along with extreme values. The&nbsp; whiskers extend to points that lie within 1.5 IQRs of the lower and&nbsp; upper quartile, and then observations that fall outside this range are&nbsp; displayed independently. This means that each value in the boxplot&nbsp; corresponds to an actual observation in the data.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">sns.catplot(x=\"day\", y=\"total_bill\", kind=\u201cbox\", &nbsp;data=tips)&nbsp;<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/LMi7BZTI5jVyEMHRski-q2gkOXRpiVfu9NJHGnakCPujlT8MGtgg-XLGcC19-S2BpBpzxK-QVhYYkQ5iwoJn4F5o6ciUDWaKKuGqYRP5FgaRe3q7RAt6YtUqTMBe2OrttwIbyjVt\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>When adding a hue semantic, the box for each level of the semantic&nbsp; variable is moved along the categorical axis so they don\u2019t overlap.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">sns.catplot(x=\"day\", y=\"total_bill\", hue=\u201csmoker\",&nbsp; kind=\"box\", data=tips)<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/FnEbJp_6IbqmEqlwXRhqjcM0VEP6ofYGsPPn72ONCgytAmM4FGrTqTCR4NGCkqarI24Sj4qpxE69Ur-wQy8ruPTbyrI3b75Ofn4q6NhNqgwpBvDvk2TNVuEZuN_x9DlXi4SP6mcX\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>This behavior is called \u201cdodging\u201d and is turned on by default because it&nbsp; is assumed that the semantic variable is nested within the main&nbsp; categorical variable.&nbsp;<\/p>\n\n\n\n<p>A related function, boxenplot(), draws a plot that is similar to a box&nbsp; plot but optimized for showing more information about the shape of the&nbsp; distribution. It is best suited for larger datasets.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">sns.catplot(x=\"day\", y=\"total_bill\", kind=\u201cboxen\",&nbsp; data=tips,hue=\u2018smoker')&nbsp;<\/pre>\n\n\n\n<p><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/LoHn9Wx0tm0GRaZFxJM-XrWd6xxraWWVdcDTQFq7pZikA2EuaFeSu93ioHdpIPgt1VPk9On4ggB28KUV2pgoVU9W19tDe5HeV9vwt9ITwW4e3YLMpzQInLz9Awu23KObgs825_1C\" width=\"543\" height=\"392\" alt=\"\" title=\"\"><strong>Violinplots&nbsp;&nbsp;<\/strong><\/p>\n\n\n\n<p>A violin plot plays a similar role as a box and whisker plot. It shows the\u00a0 distribution of quantitative data across several levels of one or more\u00a0 categorical data using seaborn variables such that those distributions can be compared.\u00a0 Unlike a box plot, in which all of the plot components correspond to\u00a0 actual datapoints, the violin plot features a kernel density estimation of\u00a0 the underlying distribution.\u00a0<\/p>\n\n\n\n<p>This can be an effective and attractive way to show multiple&nbsp; distributions of data at once, but keep in mind that the estimation&nbsp; procedure is influenced by the sample size, and violins for relatively&nbsp; small samples might look misleadingly smooth.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">sns.catplot(x=\"day\", y=\"total_bill\", kind=\u201cviolin\",&nbsp; data=tips)&nbsp;<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/tig6hEpn_wePnx9dJlAZsopcLNeEYt3BioZI7HV280QIdrF-0gEb2VKTvsS7aOvoZLYsjWaf9Dv22U0aeN0mUiUXxnUYAyQml3ouOwd_CCpO82BQHu47t4WFCvqcdknVSt-X46yQ\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>This approach uses the kernel density estimate to provide a richer&nbsp; description of the distribution of values. Additionally, the quartile and&nbsp; whisker values from the boxplot are shown inside the violin.&nbsp;&nbsp;<\/p>\n\n\n\n<p>We can also possible to \u201csplit\u201d the violins when the hue parameter has&nbsp; only two levels, which can allow for a more efficient use of space&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">sns.catplot(x=\"day\", y=\"total_bill\", kind=\"violin\", &nbsp;data=tips,hue='smoker')&nbsp;<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/skDZkttJXm6QWNmv7uWO6MOorstXDS3k0WdM2Z8pZCEb5PJoNniDo_DjPHigFzs4eZaLvgjoeMfmX8-gDZSmNOBPGYqDTpYwh_FyJLqpFnFR9EzuMfdCLE9VeivHXxddS3z96G8k\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>It can also be useful to combine swarmplot() or striplot() with a&nbsp; box plot or violin plot to show each observation along with a summary of&nbsp; the distribution&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">g = sns.catplot(x=\"day\", y=\u201ctotal_bill\",&nbsp; kind=\"violin\", inner=<strong>None<\/strong>, data=tips)&nbsp;&nbsp;<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">sns.swarmplot(x=\"day\", y=\"total_bill\", color=\u201ck\",&nbsp; size=3, data=tips, ax=g.ax)&nbsp;<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/2GXYs1wLzDyFJDpBF7rxh6SyDMTiYHdrAJ0-eF6Kl7A_Y3uAvCeC5KHR70QAO-4G31lTq3nvWY_ZNkxiBEsMIsZOdOULr7UD0A30vOrkKwu6JbcM7cfs3n0drZ8TkQmRgDHym1Pf\" alt=\"\" title=\"\"><\/figure>\n\n\n\n<p>In the next article we will learn how to plot statistical estimation within&nbsp; categories along with joint and pair plots<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the relational plot tutorial, we saw how to use different visual representations to show the relationship between multiple variables in a&nbsp; dataset. In the examples, we focused on cases where the main relationship was between two numerical variables. If one of the main variables is categorical ( divided into discrete groups ), it may [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":5931,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":"","_members_access_role":[],"_members_access_error":""},"categories":[500],"tags":[],"class_list":["post-5872","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science-using-python-tutorials"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/5872","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/comments?post=5872"}],"version-history":[{"count":0,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/posts\/5872\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media\/5931"}],"wp:attachment":[{"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/media?parent=5872"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/categories?post=5872"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.h2kinfosys.com\/blog\/wp-json\/wp\/v2\/tags?post=5872"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}