# Yet Another Game Of Life

The famous Game of Life, devised by mathematician John Conway about four decades ago, is the best-known example of a cellular automaton. It has probably been coded up in every possible computer language about a million times by now, as evidenced by this plethora of examples. I was introduced to the rules of the game already 15 years ago during my first year in college, and became immediately fascinated by the complexity and beauty of some of the emerging patterns. At that time I implemented it in C, and soon forgot about it. This is already a very good reason to revisit the game, its rules, and to show animations produced with matplotlib in python. Let’s go!

### Rules of Conway’s game of life

As per Wikipedia, the universe of the Game of Life is an infinite two-dimensional grid of square cells, each of which is in one of two possible states, alive or dead. Every cell interacts with its eight neighbors, which are the cells that are horizontally, vertically, or diagonally adjacent. At each step in time, the following transitions occur:

1. Any live cell with fewer than two live neighbors dies, as if caused by under-population.
2. Any live cell with two or three live neighbors lives on to the next generation.
3. Any live cell with more than three live neighbors dies, as if by overcrowding.
4. Any dead cell with exactly three live neighbors becomes a live cell, as if by reproduction.

### Yet another python implementation of Life

Our python implementation will use a two-dimensional numpy array to store the grid representing the universe, with values 1 for live and 0 for dead cells. We will code a init function for the initialization of the grid and a evolve routine for the evolution of the universe. A very simple way of initializing the grid to random values, allowing for variable grid dimensions, is as follows:

def init_universe(rows, cols):
grid = np.zeros([rows, cols])
for i in range(rows):
for j in range(cols):
grid[i][j] = round(random.random())
return grid


An example of a random universe created with the init_universe(rows, cols) function with 600 cells distributed in 20 rows and 30 columns can be seen in the figure on the right. The call to generate the figure, with black cells representing live (or 1) states, is as follows:

grid = init_universe(20,30)
ax = plt.axes()
ax.matshow(grid,cmap=cm.binary)
ax.set_axis_off()


Now, for the evolution logic, let us code a function that takes a universe as input, together with the parameters that regulate its evolution, and outputs the new universe after one iteration. The classical rules of the game of life set the parameters for overcrowding, under-population and reproduction as 3, 2, 3, respectively. In our implementation, we create a padding around the original universe, which allows us to define the neighbors in an easy way without having to worry whether a particular cell is at the border or not. At every position i,j we compute the sum of all cells in positions $[i-1, i, i+1] \times [j-1, j, j+1]$ and then we subtract the center point at i,j. Then we apply the evolution logic: cells die when underpopulated or overcrowded, and new cells are born when the reproduction condition (3 alive neighbors) is fulfilled:

def evolve(grid, pars):
overcrowd, underpop, reproduction = pars
rows, cols = grid.shape
newgrid = np.zeros([rows, cols])
neighbors = np.zeros([rows,cols])
# Compute neighbours and newgrid
for i in range(rows):
for j in range(cols):
neighbors[i][j] += sum([padboard[a][b] for a in [i-1, i, i+1] \
for b in [j-1, j, j+1]])
# Evolution logic
newgrid[i][j] = grid[i][j]
if grid[i][j] and \
(neighbors[i][j] > overcrowd or neighbors[i][j] < underpop):
newgrid[i][j] = 0
elif not grid[i][j] and neighbors[i][j] == reproduction:
newgrid[i][j] = 1
return newgrid


Note that in the above code we make use of a wonderful property of arrays in python, namely that the last element of an array arr can be referenced either as arr[len(arr)-1] or as arr[-1]. Thus, we create a padboard with 2 columns and 2 rows more than the dimensions of the grid. If n is the number of rows of the grid, the padboard has n+2 rows, which range from 0 to n+1, or, equivalently, from -1 to n!

### Visualization of Life and creation of mp4 videos with matplotlib

For the visualization of the evolution of our random universe we could create a series of png plots and stitch them together to produce an animated gif. However, matplotlib also offers the possibility of generating animations and saving them directly in mp4 format. The code that follows is based on this very useful tutorial, which contains instructions to embed matplotlib animations directly in the ipython notebook.

pars = 3, 2, 3
rows, cols = 20, 20
fig = plt.figure()
ax = plt.axes()
im = ax.matshow(init_universe(rows,cols),cmap=cm.binary)
ax.set_axis_off()

def init():
im.set_data(init_universe(rows, cols))

def animate(i):
a = im.get_array()
a = evolve(a, pars)
im.set_array(a)
return [im]


In the code above, we have set the parameters for the evolution as described by the original logic of the game, and we initialize a matplotlib figure using the matshow directive. We also need the functions init and animate; the latter updates the content of the plot with evolved iterations of the universe. The matplotlib call to produce an animation and save it in mp4 format is then simply:

anim = animation.FuncAnimation(fig, animate, init_func=init, frames=100, blit=True)

anim.save('animation_random.mp4', fps=10) # fps = FramesPerSecond


A random initialization gives rise to the following animation, which ends in a configuration with 3 stable patterns (block, blinker and a diamond-shaped structure that settles to a blinker in 8 steps) after approximately 50 iterations.

While a random initialization often gives rise to interesting universes, there is a vast body of research devoted to classifying particular configurations that are known to evolve in a specific fashion (oscillators, stable figures, moving patterns…). A quick google search illustrates this point and leads to many resources for the interested reader. For starters, let us code up the initialization function of a “pulsar”, a type of oscillator with a 3-iteration period.

def init_universe_pulsar():
grid = zeros([15, 15])
line = zeros(15)
line[3:6] = 1
line[9:12] = 1
for ind in [1,6,8,13]:
grid[ind] = line
grid[:,ind] = line
return grid


To generate and save this universe, we need to modify the function init used to produce the matplotlib animation and replace the call to init_universe(rows, cols) with init_universe_pulsar(). The resulting evolution can be seen in the following video:

An interesting kind of universes are those that resemble spacecrafts. There are many of them, as a visit to this page shows. Other configurations resemble guns that emit gliders forever. By far the most famous one is the Cosper glider gun, which can be generated using the following initialization function:

def init_universe_glider_gun():
glider_gun = 38*'0' + 25*'0'+'1'+12*'0' + 23*'0'+'101'+12*'0' +\
13*'0'+'11'+6*'0'+'11'+12*'0'+'11'+'0' +\
12*'0'+'1'+3*'0'+'1'+4*'0'+'11'+12*'0'+'11'+'0' +\
'0'+'11'+8*'0'+'1'+5*'0'+'100011'+15*'0' +\
'0'+'11'+8*'0'+'1'+'000'+'1011'+4*'0'+'101'+12*'0' +\
11*'0'+'1000001'+7*'0'+'1'+12*'0' +\
12*'0'+'10001'+21*'0' + 13*'0'+'11'+23*'0' + 38*'0' +\
19*38*'0'
grid = np.array([float(g) for g in glider_gun]).reshape(30,38)
return grid


Once started, this glider gun evolves emitting gliders indefinitely, which move across the grid at -45 degrees and exit the universe bounding box through the bottom right corner. Bill Gosper discovered this first glider gun, which is so far the smallest one ever found, in 1970 and got 50 dollars from Conway for that. The discovery of the glider gun eventually led to the proof that Conway’s Game of Life could function as a Turing machine. A video of the Gosper glider gun in action can be seen below.

Here is a very nice visualization of yet another type of configuration in Life, the so-called “puffers”, patterns that move like a spaceship but leave debris behind as they evolve.

### Table-top data experiment take-away message

Conway’s game of life is a zero-player game, with evolution completely determined by its initial state, consisting on live and dead cells on a two-dimensional grid. The state of each cell varies with each iteration according to the number of populated neighbors in the adjacent cells. The game, devised in 1970, opened up a whole new area of mathematical research, the field of cellular automata, and belongs to a growing class of what are called “simulation games”. Implementing the evolution algorithm behind the game in any programming language is a classical exercise in many CS schools, is always a lot of fun, and allows to explore the huge variety of configurations that give rise to strangely addictive evolution patterns.

# Finding the K in K-Means Clustering

A couple of weeks ago, here at The Data Science Lab we showed how Lloyd’s algorithm can be used to cluster points using k-means with a simple python implementation. We also produced interesting visualizations of the Voronoi tessellation induced by the clustering. At the end of the post we hinted at some of the shortcomings of this clustering procedure. The basic k-means is an extremely simple and efficient algorithm. However, it assumes prior knowledge of the data in order to choose the appropriate K. Other disadvantages are the sensitivity of the final clusters to the selection of the initial centroids and the fact that the algorithm can produce empty clusters. In today’s post, and by popular request, we are going to have a look at the first question, namely how to find the appropriate K to use in the k-means clustering procedure.

### Meaning and purpose of clustering, and the elbow method

Clustering consist of grouping objects in sets, such that objects within a cluster are as similar as possible, whereas objects from different clusters are as dissimilar as possible. Thus, the optimal clustering is somehow subjective and dependent on the characteristic used for determining similarities, as well as on the level of detail required from the partitions. For the purpose of our clustering experiment we use clusters derived from Gaussian distributions, i.e. globular in nature, and look only at the usual definition of Euclidean distance between points in a two-dimensional space to determine intra- and inter-cluster similarity.

The following measure represents the sum of intra-cluster distances between points in a given cluster $C_k$ containing $n_k$ points:

$\displaystyle D_k = \sum_{\mathrm{x}_i \in C_k} \sum_{\mathrm{x}_j \in C_k} ||\mathrm{x}_i - \mathrm{x}_j ||^2 = 2 n_k \sum_{\mathrm{x}_i \in C_k} ||\mathrm{x}_i - \mu_k ||^2$.

Adding the normalized intra-cluster sums of squares gives a measure of the compactness of our clustering:

$\displaystyle W_k = \sum_{k=1}^K \frac{1}{2n_k} D_k$.

This variance quantity $W_k$ is the basis of a naive procedure to determine the optimal number of clusters: the elbow method.

If you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters are chosen at this point, hence the “elbow criterion”.

But as Wikipedia promptly explains, this “elbow” cannot always be unambiguously identified. In this post we will show a more sophisticated method that provides a statistical procedure to formalize the “elbow” heuristic.

### The gap statistic

The gap statistic was developed by Stanford researchers Tibshirani, Walther and Hastie in their 2001 paper. The idea behind their approach was to find a way to standardize the comparison of $\log W_k$ with a null reference distribution of the data, i.e. a distribution with no obvious clustering. Their estimate for the optimal number of clusters $K$ is the value for which $\log W_k$ falls the farthest below this reference curve. This information is contained in the following formula for the gap statistic:

$\displaystyle \mathrm{Gap}_n(k) = E_n^*\{\log W_k\} - \log W_k$.

The reference datasets are in our case generated by sampling uniformly from the original dataset’s bounding box (see green box in the upper right plot of the figures below). To obtain the estimate $E_n^*\{\log W_k\}$ we compute the average of $B$ copies $\log W^*_k$ for $B=10$, each of which is generated with a Monte Carlo sample from the reference distribution. Those $\log W^*_k$ from the $B$ Monte Carlo replicates exhibit a standard deviation $\mathrm{sd}(k)$ which, accounting for the simulation error, is turned into the quantity

$\displaystyle s_k = \sqrt{1+1/B}\,\mathrm{sd}(k)$.

Finally, the optimal number of clusters $K$ is the smallest $k$ such that $\mathrm{Gap}(k) \geq \mathrm{Gap}(k+1) - s_{k+1}$.

### A Python implementation of the algorithm

The computation of the gap statistic involves the following steps (see original paper):

• Cluster the observed data, varying the number of clusters from $k = 1, ..., k_{\mathrm{max}}$, and compute the corresponding $W_k$.
• Generate $B$ reference data sets and cluster each of them with varying number of clusters $k = 1, ..., k_{\mathrm{max}}$. Compute the estimated gap statistic $\mathrm{Gap}(k) = (1/B) \sum_{b=1}^B \log W^*_{kb} - \log W_k$.
• With $\bar{w} = (1/B) \sum_b \log W^*_{kb}$, compute the standard deviation $\mathrm{sd}(k) = [(1/B) \sum_b (\log W^*_{kb} - \bar{w})^2]^{1/2}$ and define $\displaystyle s_k = \sqrt{1+1/B}\,\mathrm{sd}(k)$.
• Choose the number of clusters as the smallest $k$ such that $\mathrm{Gap}(k) \geq \mathrm{Gap}(k+1) - s_{k+1}$.

Our python implementation makes use of the find_centers(X, K) function defined in this post. The quantity $W_k$ is computed as follows:

def Wk(mu, clusters):
K = len(mu)
return sum([np.linalg.norm(mu[i]-c)**2/(2*len(c)) \
for i in range(K) for c in clusters[i]])


The gap statistic is implemented in the following code snapshot. Note that we use $B=10$ for the reference datasets and we span values of $K$ from 1 to 9.

def bounding_box(X):
xmin, xmax = min(X,key=lambda a:a[0])[0], max(X,key=lambda a:a[0])[0]
ymin, ymax = min(X,key=lambda a:a[1])[1], max(X,key=lambda a:a[1])[1]
return (xmin,xmax), (ymin,ymax)

def gap_statistic(X):
(xmin,xmax), (ymin,ymax) = bounding_box(X)
# Dispersion for real distribution
ks = range(1,10)
Wks = zeros(len(ks))
Wkbs = zeros(len(ks))
sk = zeros(len(ks))
for indk, k in enumerate(ks):
mu, clusters = find_centers(X,k)
Wks[indk] = np.log(Wk(mu, clusters))
# Create B reference datasets
B = 10
BWkbs = zeros(B)
for i in range(B):
Xb = []
for n in range(len(X)):
Xb.append([random.uniform(xmin,xmax),
random.uniform(ymin,ymax)])
Xb = np.array(Xb)
mu, clusters = find_centers(Xb,k)
BWkbs[i] = np.log(Wk(mu, clusters))
Wkbs[indk] = sum(BWkbs)/B
sk[indk] = np.sqrt(sum((BWkbs-Wkbs[indk])**2)/B)
sk = sk*np.sqrt(1+1/B)
return(ks, Wks, Wkbs, sk)


### Finding the K

We shall now apply our algorithm to diverse distributions and see how it performs. Using the init_board_gauss(N, k) function defined in our previous post, we produce an ensemble of 200 data points normally distributed around 3 centers and run the gap statistic on them.

X = init_board_gauss(200,3)
ks, logWks, logWkbs, sk = gap_statistic(X)


The following plot gives an idea of what is happening:

The upper left plot shows the target distribution with 3 clusters. On the right is its bounding box and one Monte Carlo sample drawn from a uniform reference distribution within that rectangle. In the middle left we see the plot of $W_k$ that is used to determine $K$ with the elbow method. Indeed a knee-like feature is observed at $K=3$, however the gap statistic is a better way of formalizing this phenomenon. On the right is the comparison of $\log W_k$ for the original and averaged reference distributions. Finally, the bottom plots show the gap quantity on the left, with a clear peak at the correct $K=3$ and the criteria for choosing it on the right. The correct $K$ is the smallest for which the quantity plotted in blue bars becomes positive. The optimal number is correctly guessed by the algorithm as $K=3$.

Let us now have a look at another example with 400 points around 5 clusters:

In this case, the elbow method would not have been conclusive, however the gap statistic correctly shows a peak in the gap at $K=5$ and the bar plot changes sign at the same correct value.

Similarly, we can study what happens when the data points are clustered around a single centroid:

It is clear in the above figures that the original and the reference distributions in the middle right plot follow the same decay law, so that no abrupt fall-off of the blue curve with respect to the red one is observed at any $K$. The bar plot shows positive values for the entire $K$ range. We conclude that $K=1$ is the correct clustering.

Finally, let us have a look at a uniform, non-clustered distribution of 200 points, generated with the init_board(N) function defined in our previous post:

In this case, the algorithm also guesses $K=1$ correctly, and it is clear from the middle right plot that both the original and the reference distributions follow exactly the same decay law, since they are essentially different samples from the same uniform distribution on [-1,1] x [-1,1]. The gap curve on the bottom left oscillates between local maxima and minima, indicating certain structures within the original distribution originated by statistical fluctuations.

### Table-top data experiment take-away message

The estimation of the optimal number of clusters within a set of data points is a very important problem, as most clustering algorithms need that parameter as input in order to group the data. Many methods have been proposed to find the proper $K$, among which the “elbow” method offers a very clear and naive solution based on intra-cluster variance. The gap statistic, proposed by Tobshirani et al. formalizes this approach and offers an easy-to-implement algorithm that successfully finds the correct $K$ in the case of globular, Gaussian-distributed, mildly disjoint data distributions.

Update: For a proper initialization of the centroids at the start of the k-means algorithm, we implement the improved k-means++ seeding procedure.

Update: For a comparison of this approach with an alternative method for finding the K in k-means clustering, read this article.

# Beautiful Plots With Pandas and Matplotlib

Data visualization plays a crucial role in the communication of results from data analyses, and it should always help transmit insights in an honest and clear way. Recently, the highly recommendable blog Flowing Data posted a review of data visualization highlights during 2013, and at The Data Science Lab we felt like doing a bit of pretty plotting as well.

For Python lovers, matplotlib is the library of choice when it comes to plotting. Quite conveniently, the data analysis library pandas comes equipped with useful wrappers around several matplotlib plotting routines, allowing for quick and handy plotting of data frames. Nice examples of plotting with pandas can be seen for instance in this ipython notebook. Still, for customized plots or not so typical visualizations, the panda wrappers need a bit of tweaking and playing with matplotlib’s inside machinery. If one is willing to devote a bit of time to google-ing and experimenting, very beautiful plots can emerge.

### Visualizing demographic data

For this pre-Christmas data visualization table-top experiment we are going to use demographic data from countries in the European Union obtained from Wolfram|Alpha. Our data set contains information on population, extension and life expectancy in 24 European countries. We create a pandas data frame from three series that we simply construct from lists, setting the countries as index for each series, and consequently for the data frame.

import pandas as pd
import matplotlib as mpl
from matplotlib.colors import LinearSegmentedColormap
from matplotlib.lines import Line2D

countries = ['France','Spain','Sweden','Germany','Finland','Poland','Italy',
'United Kingdom','Romania','Greece','Bulgaria','Hungary',
'Portugal','Austria','Czech Republic','Ireland','Lithuania','Latvia',
'Croatia','Slovakia','Estonia','Denmark','Netherlands','Belgium']
extensions = [547030,504782,450295,357022,338145,312685,301340,243610,238391,
131940,110879,93028,92090,83871,78867,70273,65300,64589,56594,
49035,45228,43094,41543,30528]
populations = [63.8,47,9.55,81.8,5.42,38.3,61.1,63.2,21.3,11.4,7.35,
9.93,10.7,8.44,10.6,4.63,3.28,2.23,4.38,5.49,1.34,5.61,
16.8,10.8]
life_expectancies = [81.8,82.1,81.8,80.7,80.5,76.4,82.4,80.5,73.8,80.8,73.5,
74.6,79.9,81.1,77.7,80.7,72.1,72.2,77,75.4,74.4,79.4,81,80.5]
data = {'extension' : pd.Series(extensions, index=countries),
'population' : pd.Series(populations, index=countries),
'life expectancy' : pd.Series(life_expectancies, index=countries)}

df = pd.DataFrame(data)
df = df.sort('life expectancy')


Now, thanks to the pandas plotting machinery, it is extremely straightforward to show the contents of this data frame by calling the pd.plot function. The code below generates a figure with three subplots displayed vertically, each of which shows a bar plot for a particular column of the data frame. The plots are automatically labelled with the column names of the data frame, and the whole procedure takes literally no time.

fig, axes = plt.subplots(nrows=3, ncols=1)
for i, c in enumerate(df.columns):
df[c].plot(kind='bar', ax=axes[i], figsize=(12, 10), title=c)
plt.savefig('EU1.png', bbox_inches='tight')


The output figure looks like this:

### Customization with matplotlib directives

While this is an acceptable plot for the first steps of data exploration, the figure is not really publication-ready. It also looks very much “academic” and lacks that subtle flair that infographics in mainstream media have. Over the next paragraphs we will turn this plot into a much more beautiful object by playing around with the options that matplotlib supplies.

Let us first start by creating a figure and an axis object that will contain our subfigure:

# Create a figure of given size
fig = plt.figure(figsize=(16,12))
# Set title
ttl = 'Population, size and age expectancy in the European Union'


Colors are very important for data visualizations. By default, the matplotlib color palette offers solid hues, which can be softened by applying transparencies. Similarly, the default colorbars can be customized to match our taste (see below how one can define a custom-made color map with a gradient that softly changes from orange to gray-blue hues).

# Set color transparency (0: transparent; 1: solid)
a = 0.7
# Create a colormap
customcmap = [(x/24.0,  x/48.0, 0.05) for x in range(len(df))]


The main plotting instruction in our figure uses the pandas plot wrapper. In the initialization options, we specify the type of plot (horizontal bar), the transparency, the color of the bars following the above-defined custom color map, the x-axis limits and the figure title. We also set the color of the bar borders to white for a cleaner look.

# Plot the 'population' column as horizontal bar plot
df['population'].plot(kind='barh', ax=ax, alpha=a, legend=False, color=customcmap,
edgecolor='w', xlim=(0,max(df['population'])), title=ttl)


After this simple pandas plot directive, the figure already looks very promising. Note that, because we sorted the data frame by life expectancy and applied a gradient color map, the color of the different bars in itself carries information. We will explicitly label that information below when constructing a color bar. For now we want to remove the grid, frame and axes lines from our plot, as well as customize its title and x,y axes labels.

# Remove grid lines (dotted lines inside plot)
ax.grid(False)
# Remove plot frame
ax.set_frame_on(False)
# Pandas trick: remove weird dotted line on axis
ax.lines[0].set_visible(False)

# Customize title, set position, allow space on top of plot for title
ax.set_title(ax.get_title(), fontsize=26, alpha=a, ha='left')
ax.title.set_position((0,1.08))

# Set x axis label on top of plot, set label text
ax.xaxis.set_label_position('top')
xlab = 'Population (in millions)'
ax.set_xlabel(xlab, fontsize=20, alpha=a, ha='left')
ax.xaxis.set_label_coords(0, 1.04)

# Position x tick labels on top
ax.xaxis.tick_top()
# Remove tick lines in x and y axes
ax.yaxis.set_ticks_position('none')
ax.xaxis.set_ticks_position('none')

# Customize x tick lables
xticks = [5,10,20,50,80]
ax.xaxis.set_ticks(xticks)
ax.set_xticklabels(xticks, fontsize=16, alpha=a)

# Customize y tick labels
yticks = [item.get_text() for item in ax.get_yticklabels()]
ax.set_yticklabels(yticks, fontsize=16, alpha=a)


So far, the lenghts of our horizontal bars display the population (in millions) of the EU countries. All bars have the same height (which is set to 50% of the total space between bars by default by pandas). An interesting idea is to use the height of the bars to display further data. If we could made the bar height dependent on, say, the countries’ extension, we would be adding an supplementary piece of information to the plot. This is possible in matplotlib by accessing the elements that contain the bars and assigning them a specific height in a for loop. Each bar is an element of the class Rectangle, and all the corresponding class methods can be applied to it. For assigning a given height according to each country’s extension, we code a simple linear interpolation and create a lambda function to apply it.

# Set bar height dependent on country extension
# Set min and max bar thickness (from 0 to 1)
hmin, hmax = 0.3, 0.9
xmin, xmax = min(df['extension']), max(df['extension'])
# Function that interpolates linearly between hmin and hmax
f = lambda x: hmin + (hmax-hmin)*(x-xmin)/(xmax-xmin)
# Make array of heights
hs = [f(x) for x in df['extension']]

# Iterate over bars
for container in ax.containers:
# Each bar has a Rectangle element as child
for i,child in enumerate(container.get_children()):
# Reset the lower left point of each bar so that bar is centered
child.set_y(child.get_y()- 0.125 + 0.5-hs[i]/2)
# Attribute height to each Recatangle according to country's size
plt.setp(child, height=hs[i])


Having added this “dimension” to the plot, we need a way of labelling the information so that the countries’ extension is understandable. A legend would be the ideal solution, but since our plotting directive was set to display the column ['population'], we can not use the default. We can construct a “fake” legend though, and custom-made its handles to roughly match the height of the bars. We position the legend in the lower right part of our plot.

# Legend
# Create fake labels for legend
l1 = Line2D([], [], linewidth=6, color='k', alpha=a)
l2 = Line2D([], [], linewidth=12, color='k', alpha=a)
l3 = Line2D([], [], linewidth=22, color='k', alpha=a)

# Set three legend labels to be min, mean and max of countries extensions
# (rounded up to 10k km2)
rnd = 10000
labels = [str(int(round(l/rnd)*rnd)) for l in min(df['extension']),
mean(df['extension']), max(df['extension'])]

# Position legend in lower right part
# Set ncol=3 for horizontally expanding legend
leg = ax.legend([l1, l2, l3], labels, ncol=3, frameon=False, fontsize=16,
bbox_to_anchor=[1.1, 0.11], handlelength=2,

# Customize legend title
# Set position to increase space between legend and labels
plt.setp(leg.get_title(), fontsize=20, alpha=a)
leg.get_title().set_position((0, 10))
# Customize transparency for legend labels
[plt.setp(label, alpha=a) for label in leg.get_texts()]


Finally, there is another piece of information in the plot that needs to be labelled, and that is the color map indicating the average life expectancy in the EU countries. Since we used a custom-made color map, the regular call to plt.colorbar() would not work. We need to create a LinearSegmentedColormap instead and “trick” matplotlib to display it as a colorbar. Then we can use the usual customization methods from colorbar to set fonts, transparency, position and size of the diverse elements in the color legend.

# Create a fake colorbar
ctb = LinearSegmentedColormap.from_list('custombar', customcmap, N=2048)
# Trick from http://stackoverflow.com/questions/8342549/
sm = plt.cm.ScalarMappable(cmap=ctb, norm=plt.normalize(vmin=72, vmax=84))
# Fake up the array of the scalar mappable
sm._A = []

# Set colorbar, aspect ratio
cbar = plt.colorbar(sm, alpha=0.05, aspect=16, shrink=0.4)
cbar.solids.set_edgecolor("face")
# Remove colorbar container frame
cbar.outline.set_visible(False)
# Fontsize for colorbar ticklabels
cbar.ax.tick_params(labelsize=16)
# Customize colorbar tick labels
mytks = range(72,86,2)
cbar.set_ticks(mytks)
cbar.ax.set_yticklabels([str(a) for a in mytks], alpha=a)

# Colorbar label, customize fontsize and distance to colorbar
cbar.set_label('Age expectancy (in years)', alpha=a,
# Remove color bar tick lines, while keeping the tick labels
cbarytks = plt.getp(cbar.ax.axes, 'yticklines')
plt.setp(cbarytks, visible=False)


The final and most rewarding step consists of saving the figure in our preferred format.

# Save figure in png with tight bounding box
plt.savefig('EU.png', bbox_inches='tight', dpi=300)


The final result looks this beautiful:

### Table-top data experiment take-away message

When producing a plot based on multidimensional data, it is a good idea to resort to shapes and colors that visually guide us through the variables on display. Matplotlib offers a high level of customization for all details of a plot, albeit the truth is that finding exactly which knob to tweak might be at times bewildering. Beautiful plots can be created by experimenting with various settings, among which hues, transparencies and simple layouts are the focal points. The results are publication-ready figures with open-source software that can be easily replicated by means of structured python code.