Choosing color palettes in Seaborn
Seaborn makes it easy to use colors that are well-suited to the characteristics of your data and your visualization goals. This chapter discusses both the general principles that should guide your choices and the tools in seaborn that help you quickly find the best solution for a given application.
General principles for using color in plots
Components of color
Because of the way our eyes work, a color can be defined using three components. We usually program colors in a computer by specifying their RGB values, which set the intensity of the red, green, and blue channels in a display. But for analyzing the perceptual attributes of color, it’s better to think in terms of hue, saturation, and luminance channels.
Hue is the component that distinguishes “different colors” in a non-technical sense. It’s property of color that leads to first-order names like “red” and “blue”:
Saturation or chroma is the colorfulness. Two colors with different hues will look more distinct when they have more saturation
And lightness corresponds to how much light is emitted or reflected, for printed colors, ranging from black to white
Vary hue to distinguish categories
When you want to represent multiple categories in a plot, you will usually want to vary the color of the elements. Consider this simple example: in which of these two plots is it easier to count the number of triangular points?
In the plot on the right, the orange triangles “pop out”, making it easy to distinguish them from the circles. This pop-out effect happens because our visual system prioritizes color differences.
The blue and orange colors differ mostly in terms of their hue. Hue is useful for representing categories, because most people can distinguish a moderate number of hues relatively easily, and points that have different hues but similar brightness or intensity seem equally important. It also makes plots easier to talk about. Consider this example
Most people would be able to quickly ascertain that there are five distinct categories in the plot on the left and, if asked to characterize the “blue” points, would be able to do so.
With the plot on the right, where the points are all blue but vary in their luminance and saturation, it’s harder to say how many unique categories are present. And how would we talk about a particular category? “The fairly-but-not-too-blue points?” What’s more, the gray dots seem to fade into the background, de-emphasizing them relative to the more intense blue dots. If the categories are equally important, this is a poor representation.
So as a general rule, use hue variation to represent categories. With that said, here are few notes of caution. If you have more than a handful of colors in your plot, it can become difficult to keep in mind what each one means, unless there are pre-existing associations between the categories and the colors used to represent them.
This makes your plot harder to interpret: rather than focusing on the data, a viewer will have to continually refer to the legend to make sense of what is shown. So you should strive not to make plots that are too complex. And it is important to be mindful that not everyone sees colors the same way. Varying both shape (or some other attribute) and color can help people with anomalous color vision understand your plots, and it can keep them (somewhat) interpretable if they are printed to black-and-white.
Vary luminance to represent numbers
On the other hand, hue variations are not well suited to represent numeric data. Consider this example, where we need colors to represent the counts in a bivariate histogram.
On the left, we use a circular colormap, where gradual changes in hue correspond to gradual changes in the number of observations within each bin. On the right, we use a palette that uses brighter colors to represent bins with larger counts
With the hue-based palette, it’s quite difficult to ascertain the shape of the bivariate distribution. In contrast, the luminance palette makes it much more clear that there are two prominent peaks.
Varying luminance helps you see structure in data, and changes in luminance are more intuitively processed as changes in importance. But the plot on the right does not a grayscale colormap. Its colorfulness is more interesting, and the subtle hue variation increases the perceptual distance between two values, making small differences slightly easier to resolve.
These examples show that color palette choices are about more than aesthetics: the colors you choose can reveal patterns in your data if used effectively or hide them if used poorly. There is not one optimal palette, but there are palettes that are better or worse for particular datasets and visualization approaches.
And aesthetics do matter: the more that people want to look at your figures, the greater the chance that they will learn something from them. This is true even when you are making plots for yourself.
During exploratory data analysis, you may generate many similar figures. Varying the color palettes will add a sense of novelty, which keeps you engaged and prepared to notice interesting features of your data. So how can you choose color palettes that represent your data well and look attractive?
Tools for choosing color palettes
The most important function for working with color palettes is, aptly, color_palette(). This function provides an interface to most of the possible ways that one can generate color palettes in seaborn. And it’s used internally by any function that has a palette argument.
The primary argument to color_palette() is usually a string: either the name of a specific palette or the name of a family and additional arguments to select a specific member.
In the latter case, color_palette() will delegate to more specific function, such as cubehelix_palette().
It’s also possible to pass a list of colors specified any way that matplotlib accepts an RGB tuple, a hex code, or a name in the X11 table.
The return value is an object that wraps a list of RGB tuples with a few useful methods, such as conversion to hex codes and a rich HTML representation.
Calling color_palette() with no arguments will return the current default color palette that matplotlib (and most seaborn functions) will use if colors are not otherwise specified. This default palette can be set with the corresponding set_palette() function, which calls color_palette() internally and accepts the same arguments.
To motivate the different options that color_palette() provides, it will be useful to introduce a classification scheme for color palettes. Broadly, palettes fall into one of three categories:
- Qualitative palettes, good for representing categorical data
- Sequential palettes, good for representing numeric data
Diverging palettes, good for representing numeric data with a categorical boundary.