December 21, 2013

Beautiful Plots With Pandas and Matplotlib

[Click here to see the final plot described in this article.]

Data visualization plays a crucial role in the communication of results from data analyses, and it should always help transmit insights in an honest and clear way. Recently, the highly recommendable blog Flowing Data posted a review of data visualization highlights during 2013, and at The Data Science Lab we felt like doing a bit of pretty plotting as well.

For Python lovers, matplotlib is the library of choice when it comes to plotting. Quite conveniently, the data analysis library pandas comes equipped with useful wrappers around several matplotlib plotting routines, allowing for quick and handy plotting of data frames. Nice examples of plotting with pandas can be seen for instance in this ipython notebook. Still, for customized plots or not so typical visualizations, the panda wrappers need a bit of tweaking and playing with matplotlib’s inside machinery. If one is willing to devote a bit of time to google-ing and experimenting, very beautiful plots can emerge.

Visualizing demographic data

For this pre-Christmas data visualization table-top experiment we are going to use demographic data from countries in the European Union obtained from Wolfram|Alpha. Our data set contains information on population, extension and life expectancy in 24 European countries. We create a pandas data frame from three series that we simply construct from lists, setting the countries as index for each series, and consequently for the data frame.

import pandas as pd
import matplotlib as mpl
from matplotlib.colors import LinearSegmentedColormap
from matplotlib.lines import Line2D 

countries = ['France','Spain','Sweden','Germany','Finland','Poland','Italy',
             'United Kingdom','Romania','Greece','Bulgaria','Hungary',
             'Portugal','Austria','Czech Republic','Ireland','Lithuania','Latvia',
             'Croatia','Slovakia','Estonia','Denmark','Netherlands','Belgium']
extensions = [547030,504782,450295,357022,338145,312685,301340,243610,238391,
              131940,110879,93028,92090,83871,78867,70273,65300,64589,56594,
              49035,45228,43094,41543,30528]
populations = [63.8,47,9.55,81.8,5.42,38.3,61.1,63.2,21.3,11.4,7.35,
               9.93,10.7,8.44,10.6,4.63,3.28,2.23,4.38,5.49,1.34,5.61,
               16.8,10.8]
life_expectancies = [81.8,82.1,81.8,80.7,80.5,76.4,82.4,80.5,73.8,80.8,73.5,
                    74.6,79.9,81.1,77.7,80.7,72.1,72.2,77,75.4,74.4,79.4,81,80.5]
data = {'extension' : pd.Series(extensions, index=countries), 
        'population' : pd.Series(populations, index=countries),
        'life expectancy' : pd.Series(life_expectancies, index=countries)}

df = pd.DataFrame(data)
df = df.sort('life expectancy')

Now, thanks to the pandas plotting machinery, it is extremely straightforward to show the contents of this data frame by calling the pd.plot function. The code below generates a figure with three subplots displayed vertically, each of which shows a bar plot for a particular column of the data frame. The plots are automatically labelled with the column names of the data frame, and the whole procedure takes literally no time.

fig, axes = plt.subplots(nrows=3, ncols=1)
for i, c in enumerate(df.columns):
    df[c].plot(kind='bar', ax=axes[i], figsize=(12, 10), title=c)
plt.savefig('EU1.png', bbox_inches='tight')

The output figure looks like this:

Customization with matplotlib directives

While this is an acceptable plot for the first steps of data exploration, the figure is not really publication-ready. It also looks very much “academic” and lacks that subtle flair that infographics in mainstream media have. Over the next paragraphs we will turn this plot into a much more beautiful object by playing around with the options that matplotlib supplies.

Let us first start by creating a figure and an axis object that will contain our subfigure:

# Create a figure of given size
fig = plt.figure(figsize=(16,12))
# Add a subplot
ax = fig.add_subplot(111)
# Set title
ttl = 'Population, size and age expectancy in the European Union'

Colors are very important for data visualizations. By default, the matplotlib color palette offers solid hues, which can be softened by applying transparencies. Similarly, the default colorbars can be customized to match our taste (see below how one can define a custom-made color map with a gradient that softly changes from orange to gray-blue hues).

# Set color transparency (0: transparent; 1: solid)
a = 0.7
# Create a colormap
customcmap = [(x/24.0,  x/48.0, 0.05) for x in range(len(df))]

The main plotting instruction in our figure uses the pandas plot wrapper. In the initialization options, we specify the type of plot (horizontal bar), the transparency, the color of the bars following the above-defined custom color map, the x-axis limits and the figure title. We also set the color of the bar borders to white for a cleaner look.

# Plot the 'population' column as horizontal bar plot
df['population'].plot(kind='barh', ax=ax, alpha=a, legend=False, color=customcmap,
                      edgecolor='w', xlim=(0,max(df['population'])), title=ttl)

After this simple pandas plot directive, the figure already looks very promising. Note that, because we sorted the data frame by life expectancy and applied a gradient color map, the color of the different bars in itself carries information. We will explicitly label that information below when constructing a color bar. For now we want to remove the grid, frame and axes lines from our plot, as well as customize its title and x,y axes labels.

# Remove grid lines (dotted lines inside plot)
ax.grid(False)
# Remove plot frame
ax.set_frame_on(False)
# Pandas trick: remove weird dotted line on axis
ax.lines[0].set_visible(False)

# Customize title, set position, allow space on top of plot for title
ax.set_title(ax.get_title(), fontsize=26, alpha=a, ha='left')
plt.subplots_adjust(top=0.9)
ax.title.set_position((0,1.08))

# Set x axis label on top of plot, set label text
ax.xaxis.set_label_position('top')
xlab = 'Population (in millions)'
ax.set_xlabel(xlab, fontsize=20, alpha=a, ha='left')
ax.xaxis.set_label_coords(0, 1.04)

# Position x tick labels on top
ax.xaxis.tick_top()
# Remove tick lines in x and y axes
ax.yaxis.set_ticks_position('none')
ax.xaxis.set_ticks_position('none')

# Customize x tick lables
xticks = [5,10,20,50,80]
ax.xaxis.set_ticks(xticks)
ax.set_xticklabels(xticks, fontsize=16, alpha=a)

# Customize y tick labels
yticks = [item.get_text() for item in ax.get_yticklabels()]
ax.set_yticklabels(yticks, fontsize=16, alpha=a)
ax.yaxis.set_tick_params(pad=12)

So far, the lenghts of our horizontal bars display the population (in millions) of the EU countries. All bars have the same height (which is set to 50% of the total space between bars by default by pandas). An interesting idea is to use the height of the bars to display further data. If we could made the bar height dependent on, say, the countries’ extension, we would be adding an supplementary piece of information to the plot. This is possible in matplotlib by accessing the elements that contain the bars and assigning them a specific height in a for loop. Each bar is an element of the class Rectangle, and all the corresponding class methods can be applied to it. For assigning a given height according to each country’s extension, we code a simple linear interpolation and create a lambda function to apply it.

# Set bar height dependent on country extension
# Set min and max bar thickness (from 0 to 1)
hmin, hmax = 0.3, 0.9
xmin, xmax = min(df['extension']), max(df['extension'])
# Function that interpolates linearly between hmin and hmax
f = lambda x: hmin + (hmax-hmin)*(x-xmin)/(xmax-xmin)
# Make array of heights
hs = [f(x) for x in df['extension']]

# Iterate over bars
for container in ax.containers:
    # Each bar has a Rectangle element as child
    for i,child in enumerate(container.get_children()):
        # Reset the lower left point of each bar so that bar is centered
        child.set_y(child.get_y()- 0.125 + 0.5-hs[i]/2)
        # Attribute height to each Recatangle according to country's size
        plt.setp(child, height=hs[i])

Having added this “dimension” to the plot, we need a way of labelling the information so that the countries’ extension is understandable. A legend would be the ideal solution, but since our plotting directive was set to display the column ['population'], we can not use the default. We can construct a “fake” legend though, and custom-made its handles to roughly match the height of the bars. We position the legend in the lower right part of our plot.

# Legend
# Create fake labels for legend
l1 = Line2D([], [], linewidth=6, color='k', alpha=a) 
l2 = Line2D([], [], linewidth=12, color='k', alpha=a) 
l3 = Line2D([], [], linewidth=22, color='k', alpha=a)

# Set three legend labels to be min, mean and max of countries extensions 
# (rounded up to 10k km2)
rnd = 10000
labels = [str(int(round(l/rnd)*rnd)) for l in min(df['extension']), 
          mean(df['extension']), max(df['extension'])]

# Position legend in lower right part
# Set ncol=3 for horizontally expanding legend
leg = ax.legend([l1, l2, l3], labels, ncol=3, frameon=False, fontsize=16, 
                bbox_to_anchor=[1.1, 0.11], handlelength=2, 
                handletextpad=1, columnspacing=2, title='Size (in km2)')

# Customize legend title
# Set position to increase space between legend and labels
plt.setp(leg.get_title(), fontsize=20, alpha=a)
leg.get_title().set_position((0, 10))
# Customize transparency for legend labels
[plt.setp(label, alpha=a) for label in leg.get_texts()]

Finally, there is another piece of information in the plot that needs to be labelled, and that is the color map indicating the average life expectancy in the EU countries. Since we used a custom-made color map, the regular call to plt.colorbar() would not work. We need to create a LinearSegmentedColormap instead and “trick” matplotlib to display it as a colorbar. Then we can use the usual customization methods from colorbar to set fonts, transparency, position and size of the diverse elements in the color legend.

# Create a fake colorbar
ctb = LinearSegmentedColormap.from_list('custombar', customcmap, N=2048)
# Trick from http://stackoverflow.com/questions/8342549/
# matplotlib-add-colorbar-to-a-sequence-of-line-plots
sm = plt.cm.ScalarMappable(cmap=ctb, norm=plt.normalize(vmin=72, vmax=84))
# Fake up the array of the scalar mappable
sm._A = []

# Set colorbar, aspect ratio
cbar = plt.colorbar(sm, alpha=0.05, aspect=16, shrink=0.4)
cbar.solids.set_edgecolor("face")
# Remove colorbar container frame
cbar.outline.set_visible(False)
# Fontsize for colorbar ticklabels
cbar.ax.tick_params(labelsize=16)
# Customize colorbar tick labels
mytks = range(72,86,2)
cbar.set_ticks(mytks)
cbar.ax.set_yticklabels([str(a) for a in mytks], alpha=a)

# Colorbar label, customize fontsize and distance to colorbar
cbar.set_label('Age expectancy (in years)', alpha=a, 
               rotation=270, fontsize=20, labelpad=20)
# Remove color bar tick lines, while keeping the tick labels
cbarytks = plt.getp(cbar.ax.axes, 'yticklines')
plt.setp(cbarytks, visible=False)

The final and most rewarding step consists of saving the figure in our preferred format.

# Save figure in png with tight bounding box
plt.savefig('EU.png', bbox_inches='tight', dpi=300)

The final result looks this beautiful:

Table-top data experiment take-away message

When producing a plot based on multidimensional data, it is a good idea to resort to shapes and colors that visually guide us through the variables on display. Matplotlib offers a high level of customization for all details of a plot, albeit the truth is that finding exactly which knob to tweak might be at times bewildering. Beautiful plots can be created by experimenting with various settings, among which hues, transparencies and simple layouts are the focal points. The results are publication-ready figures with open-source software that can be easily replicated by means of structured python code.

25 comments

May 29, 2014 - 3:14 pm Sunil

Hi all,
I’am new to python.
When i copy paste and ran python file, it threw an error stating “plt is not defined”.
Can anyone help me??

Reply
- June 3, 2014 - 3:09 pm Kaspar Snashall
  
  you need to import the matplotlib module
  by
  import matplotlib.pylab as plt
  
  if its say no module named matplotlib try downloading spyder this is a guy with a lot of installed modules then it should work
  
  Reply
- June 14, 2014 - 2:15 pm datasciencelab
  
  Hi Sunil,
  Kaspar is right, one needs to import matplotlib.
  If running this code directly in the ipython notebook, the plotting modules are already imported. You can just start the ipython notebook as:
  $> ipython notebook –pylab=inline
  
  Reply
June 3, 2014 - 3:10 pm Kaspar Snashall

gui* not guy

Reply
June 10, 2014 - 7:04 pm Anonymous

the anaconda package is probably the easiest way to start learning python for data vis and uses spyder as default IDE.
http://continuum.io/downloads

Reply
August 12, 2014 - 4:51 pm Pingback: Point size legends in matplotlib and basemap plots | Dr Jonathan Bright
August 21, 2014 - 7:25 pm gammacephei

Thank you for these suggestions—I used them for a similar barh() chart and was able to get rid of some junk. Thanks especially for the tip for getting rid of the ugly dashed line that pandas puts in the middle of bar charts that pass below zero.

Reply
October 7, 2014 - 9:41 pm Pingback: A gentle introduction to Pandas with Python | O/ blog
January 20, 2015 - 6:21 am JM

Not beautiful! Not… one… bit!

Reply
March 1, 2015 - 6:42 pm Pingback: Pretty Plots!!! | sciencecuriousmind
July 12, 2015 - 7:22 pm Michael Reinhard (@MichaelReinhar9)

I really appreciate this code and tutorial. I have finally started to understand some of the stuff going on in Python plotting.

So I have been working through the code and have gotten it to work all the way down to the part where you create the legend for the width of the bars. I have gotten the NameError: name ‘mean’ is not defined. Here is the line where this occurs:

labels = [str(int(round(l/rnd)*rnd)) for l in min(df[‘extension’]),
mean(df[‘extension’]), max(df[‘extension’])]

Here is the error message:

NameError: name ‘mean’ is not defined

I have tried pd.mean() and mpl.mean()–oh, wait, I just solved the problem!

You have to import numpy and do np.mean()!

It might be worth changing the code up top!

Reply
July 12, 2015 - 9:05 pm Michael Reinhard (@MichaelReinhar9)

Ok, this time I have a problem I cannot solve. I have pasted in the code to be certain that I am not making a mistake.

The problem is that the graph and the legend are not printing on top of one another. Everything is fine till the last block of code. Does anyone know how to get the legend to print on top of the graph as it should?

Reply
July 12, 2015 - 9:15 pm Michael Reinhard (@MichaelReinhar9)

Also, when I try the last bit of code to print to a file I get an assertion error:

721 assert(len(bboxes))
722
723 if len(bboxes) == 1:

I suppose that there might be a version problem or something but I am stumped.

Reply
September 9, 2015 - 11:17 pm Pingback: Data Analysis with Python – The Tools and the Data | JonShern.com
November 24, 2015 - 2:22 am Pingback: Data Analysis with Python – The Tools and the Data | Jon Shern
March 17, 2016 - 4:10 am Pingback: Beautiful Plots With Pandas and Matplotlib – Jingchu
May 31, 2016 - 3:26 am Anonymous

when use this tutorial, an error happened:
ax.lines[0].set_visible(False)
IndexError: list index out of range
can anyone help me?

Reply
- September 6, 2018 - 10:15 am Levan Alibegashvili (alibega)
  
  Same here.. any suggestions?
  Thanks
  
  Reply
June 23, 2016 - 6:33 am Doru

Traceback (most recent call last):
File “D:\NewPD\_WORK\_LINIA_PROIECT\test01.py”, line 129, in
sm = plt.cm.ScalarMappable(cmap=ctb, norm=plt.normalize(vmin=72, vmax=84))
AttributeError: ‘module’ object has no attribute ‘normalize’

?

Reply
- July 20, 2016 - 5:35 pm filmferipe Fernandes
  
  just replace by capital ‘N’ – Normalize
  
  Reply
January 25, 2017 - 1:00 am Hannah Manning

Is there a similarly elegant way to make boxplots with pandas and matplotlib? I’ve been hunting around but most options seem to leave me with little control over the aesthetics. Disclaimer: I’m fairly new to all this.

Reply
March 29, 2017 - 1:53 pm Ajan

Great post! Thank you for sharing. I am wondering if the colour of the horizontal bar can change horizontally according to the x values (e.g., to express changes over time)?

Reply
March 31, 2017 - 1:19 pm Brian

This was a nice post that really helped me get started to coloring the plots. But I really wanted to use a standard color map and it took me a bit to figure out how. What I learned is that from this site http://thomas-cokelaer.info/blog/2014/09/about-matplotlib-colormap-and-how-to-get-rgb-values-of-the-map/ you can get the actual color map array. And that the ones I check have 256 value RGB and Transparency values. So what I did was I scaled my data to 1-256. Then I used the scaled integer to choose the RGB values from the colormap I wanted. Then I colored the bars accordingly like you did. The I used the standard color map to make the legend using your steps. The only caveat was I couldn’t get the alpha values and transparency of the bars and the colorbar to match and I had the guess at the alpha value a little. Thanks for the help!

Reply
April 13, 2017 - 9:29 pm andrew

I’ve copied the code and it gives me a syntax error on the section where we create the legend.

rnd = 10000
labels = [str(int(round(l/rnd)*rnd)) for l in min(df[‘extension’]),
mean(df[‘extension’]), max(df[‘extension’])]

# Position legend in lower right part
# Set ncol=3 for horizontally expanding legend
leg = ax.legend([l1, l2, l3], labels, ncol=3, frameon=False, fontsize=16,
bbox_to_anchor=[1.1, 0.11], handlelength=2,
handletextpad=1, columnspacing=2, title=’Size (in km2)’)

python europetest.py
File “europetest.py”, line 143
labels = [str(int(round(l / rnd) * rnd)) for l in min(df[‘extension’]),
^
SyntaxError: invalid syntax

Reply
July 13, 2018 - 3:15 pm Preetam Singareddy

hi,
I have a scatter plot in pandas with a number of values. I have a line at the mean values of the Y axis data. How do I change the color of the dots, so that all dots above the mean are green and all dots below the mean are red?
Thanks.

Reply