Abstract: "this is a reformatted version of two articles I published in 2015, covering color palettes and accessibility for color blind people. It features Pi (π) as an infinite source of numbers between 0 and 9, visualizing patterns (or absence thereof) and selecting palettes and some trade offs of color selections. Also included is a link to: The Hitchhikers Guide to the Open Source Data Science Galaxy"
Picking up patterns
Or the lack thereof. The brain is pretty good at spotting patterns and anomalies. But we have to help it with something that can be easily abstracted. Numbers are not good for that.Of reds and greens and blues
Colors are good helpers. Shades, hues. Unfortunately, many people are affected by colorblindness. It is said that in some segments of the population, up to 8% of men and 0.4% of women experience congenital color deficiency, with the most common being red-green color blindness.In this article, we will look at one particularly problematic area of color selection, that of categorical variables. Here, we will use numbers 0 to 9 as categorical, even though we might not typically think of numbers as categorical. We might think of them as ordinal if they are in a relative position (first, second, third), or cardinal (one, two, three), for discrete numbers (integers).
However in this case, we just want to see patterns, so we consider them as categorical, their rank having no impact on the purpose of our visualization. If we had continuous numbers (real or float approximations) this would not be possible, and we would simply use a continuous palette (single color, diverging colors, even multidimensional).
We'll go through this as a Jupyter Notebook, using the statistical visualization package seaborn, and the classic matplotlib:
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import seaborn as sns
sns.set_context("talk")
Seaborn palettes
According to Seaborn's documentation (https://seaborn.pydata.org/generated/seaborn.color_palette.html#seaborn.color_palette), there is a ready made color palette called colorblind. There is not a lot of details on this, so I thought I'd experiment with this and ask for feedback. When I tested it in 2015, I said:"Unfortunately, only 6 colors are available until it starts recycling itself. That is clearly not good if we want to see patterns in numbers that range from 0 to 9 on each digit."
In 2020, the latest version of seaborn can now do 10 colors without repeating:
sns.set_palette(sns.color_palette("colorblind", 10))
sns.palplot(sns.color_palette())
seaborn colorblind palette 10 discrete colors |
One issue with this is the yellow (9th color). It is too bright for me in comparison to others and sticks out, which will attract undue attention compared to the others, even with a calibrated display (see color management). It also has an issue with the 5th and 6th color almost indistinguishable for someone with tritanopia.
Another interesting choice is cubehelix. It works in color and gray scale. Plus, it also retains some ranking from dark to light, allowing for applications with ordinals too. How it works, roughly: you sample an helix inside a cubic colorspace, at set intervals on the 3d curve. You can learn more about it here: http://www.mrao.cam.ac.uk/~dag/CUBEHELIX/
First thing first, let's set it as the default palette as the cubehelix palette, with 10 discrete values, and display it.
In [2]:
sns.set_palette(sns.color_palette("cubehelix", 10))
sns.palplot(sns.color_palette())
It works quite well, but, as a reader had commented in the original article, it is hard to distinguish the 4th and 5th colors if you have protanopia.
Here is cubehelix as seen by someone with deuteranopia (simulated - this is a built-in feature of visualizations in visu.ai):
Overall, it is difficult to have more than 9 perfectly distinguishable colors to cover 100% of the population, but cubehelix is close to it. And again, because it works in gray scale too, this is also extremely useful for print work, where there is no guarantee that the output will be on a color printer.
This was done by John Venn all the way back in 1866 in his book "The Logic of Chance". As he explains, he simply discarded the 8 and 9 digits. Since back then there were no computers, he picked his numbers from a book (by R. Shank) which had 707 digits of pi, leaving him with 568 digits between 0 and 7. He mapped 0 to 7 to directions (10 directions might have felt a bit odd, at 36 degrees, versus nice 45 degree lines):
An infinite source of entertainment
Or at the very least, an infinite source of digits: the number π (pi)
So we will be plotting a large grid, with each square representing a digit of pi
and filled in the color corresponding to the color in the Seaborn
palette.
Let's grab a pi digit generator. There is one here that's been
around since the days of python 2.5:
https://www.daniweb.com/programming/software-development/code/249177/pi-generator-update
https://www.daniweb.com/programming/software-development/code/249177/pi-generator-update
In [3]:
def pi_generate():
"""
generator to approximate pi
returns a single digit of pi each time iterated
"""
q, r, t, k, m, x = 1, 0, 1, 1, 3, 3
while True:
if 4 * q + r - t < m * t:
yield m
q, r, t, k, m, x = 10*q, 10*(r-m*t), t, k, (10*(3*q+r))//t - 10*m, x
else:
q, r, t, k, m, x = q*k, (2*q+r)*x, t*x, k+1, (q*(7*k+2)+r*x)//(t*x), x+2
The 10 colors of Pi
We are now ready to do our actual visualization of pi.
For legal size paper, 55 x 97 is a good size (plus the ratio is roughly the square root of pi).
For poster, I use 154 x 204 = 31416 digits ☺️
It'll run for a while, even at the reduced size, probably a good time to go and get your favorite drink ☕ ...
In [4]:
width = 55 # 154
height = 97 # 204
digit = pi_generate()
fig = plt.figure(figsize=((width + 2) / 3., (height + 2) / 3.))
ax = fig.add_axes((0.05, 0.05, 0.9, 0.9),
aspect='equal', frameon=False,
xlim=(-0.05, width + 0.05),
ylim=(-0.05, height + 0.05))
for axis in (ax.xaxis, ax.yaxis):
axis.set_major_formatter(plt.NullFormatter())
axis.set_major_locator(plt.NullLocator())
for j in range(height-1,-1,-1):
for i in range(width):
pi_digit = next(digit)
ax.add_patch(Rectangle((i, j),
width=1,
height=1,
ec=sns.color_palette()[pi_digit],
fc=sns.color_palette()[pi_digit],
)
)
ax.text(i + 0.5, j + 0.5,
pi_digit, color='k',
fontsize=10,
ha='center', va='center')
ax.text(0,-1,"'THE 10 COLORS OF PI' by Francois Dion", fontsize=15)
Out[4]:
The end result shows no real pattern, although we are arbitrarily establishing the line stride (width) to 55., so vertical patterns would be harder to detect. One horizontal pattern that does show is around the 14th row. On the right side, we see the number 9 six times in a row (don't be fooled by randomness). Moral of the story here is that if we look hard enough we can find patterns even in random sequences or noise...
What if Pi had digits going from 0 to 7?
With the cubehelix able to provide 9 distinct colors (and even more so, 8 colors) for 100% of the population, a comment on the original article suggesting calculating the number Pi using an octal based instead of a decimal base. This would be quite an interesting approach.
Looking at Pi as having digits from 0 to 7 is actually something that has been done before. But not using an octal base, simply by using the decimal representation and using all digits less than 8:
John Venn, The Logic of Chance, 1866 |
8 directions on a compass |
Although he doesn't specify the mapping, it is easy to infer from the
graph. The first digit after the decimal is 1, then 4 and we can see the
path as NE, then S, so:
0 | N |
1 | NE |
2 | E |
3 | SE |
4 | S |
5 | SW |
6 | W |
7 | NW |
The random walk
He would then move by 1 unit in the direction of each digit / direction
mapping. NE, S, NE, SW, skip 9, E, so on and so forth.
His conclusion stated:
"The result seems to me to furnish a very fair graphical indication of randomness".
Besides these two visual ways of asserting the randomness of the digits, we could also have looked at the distribution of each digits. If we generate 55 x 97 digits, or 5335 digits total, each of 0 to 9 should get an average of 533.5 on average if there is a uniform distribution.
In a future post I'll show a few ways to plot this, including using Dion Research's Hotelling visualization package.
Learn more
I encourage you to check out "The Hitchhikers Guide to the Open Source Data Science Galaxy", particularly page 12, covering some aspects of colorspace, and tools to handle color management, ambience and simulation of color impairment (such as color oracle).
Part VI of my "ex-libris" of a data scientist covers books realted to colors: https://blog.dionresearch.com/2020/04/ex-libris-of-data-scientist-part-vi.html
Also, check out this twitter thread for several papers related to color:
Another week, more classic papers that are good to know when doing data science. This week I'll talk about color theory, briefly (more than 📕📗📘). This is important for #dataviz and #communications aspects of #datascience.
— Francois Dion (@f_dion) March 25, 2019
Francois Dion
Chief Data Scientist
@f_dion
About Dion Research LLC: Makers of visu.ai. We are a boutique Data Science consultancy . established 2011. As we do end-to-end Data Science, we can help you solve business problems every step of the way. Get in touch for more information.
Chief Data Scientist
@f_dion
About Dion Research LLC: Makers of visu.ai. We are a boutique Data Science consultancy . established 2011. As we do end-to-end Data Science, we can help you solve business problems every step of the way. Get in touch for more information.
Comments
Post a Comment