Tagged: pandas
Beautiful Plots With Pandas and Matplotlib
[Click here to see the final plot described in this article.]
Data visualization plays a crucial role in the communication of results from data analyses, and it should always help transmit insights in an honest and clear way. Recently, the highly recommendable blog Flowing Data posted a review of data visualization highlights during 2013, and at The Data Science Lab we felt like doing a bit of pretty plotting as well.
For Python lovers, matplotlib is the library of choice when it comes to plotting. Quite conveniently, the data analysis library pandas comes equipped with useful wrappers around several matplotlib plotting routines, allowing for quick and handy plotting of data frames. Nice examples of plotting with pandas can be seen for instance in this ipython notebook. Still, for customized plots or not so typical visualizations, the panda wrappers need a bit of tweaking and playing with matplotlib’s inside machinery. If one is willing to devote a bit of time to google-ing and experimenting, very beautiful plots can emerge.
Visualizing demographic data
For this pre-Christmas data visualization table-top experiment we are going to use demographic data from countries in the European Union obtained from Wolfram|Alpha. Our data set contains information on population, extension and life expectancy in 24 European countries. We create a pandas data frame from three series that we simply construct from lists, setting the countries as index for each series, and consequently for the data frame.
import pandas as pd import matplotlib as mpl from matplotlib.colors import LinearSegmentedColormap from matplotlib.lines import Line2D countries = ['France','Spain','Sweden','Germany','Finland','Poland','Italy', 'United Kingdom','Romania','Greece','Bulgaria','Hungary', 'Portugal','Austria','Czech Republic','Ireland','Lithuania','Latvia', 'Croatia','Slovakia','Estonia','Denmark','Netherlands','Belgium'] extensions = [547030,504782,450295,357022,338145,312685,301340,243610,238391, 131940,110879,93028,92090,83871,78867,70273,65300,64589,56594, 49035,45228,43094,41543,30528] populations = [63.8,47,9.55,81.8,5.42,38.3,61.1,63.2,21.3,11.4,7.35, 9.93,10.7,8.44,10.6,4.63,3.28,2.23,4.38,5.49,1.34,5.61, 16.8,10.8] life_expectancies = [81.8,82.1,81.8,80.7,80.5,76.4,82.4,80.5,73.8,80.8,73.5, 74.6,79.9,81.1,77.7,80.7,72.1,72.2,77,75.4,74.4,79.4,81,80.5] data = {'extension' : pd.Series(extensions, index=countries), 'population' : pd.Series(populations, index=countries), 'life expectancy' : pd.Series(life_expectancies, index=countries)} df = pd.DataFrame(data) df = df.sort('life expectancy')
Now, thanks to the pandas plotting machinery, it is extremely straightforward to show the contents of this data frame by calling the pd.plot
function. The code below generates a figure with three subplots displayed vertically, each of which shows a bar plot for a particular column of the data frame. The plots are automatically labelled with the column names of the data frame, and the whole procedure takes literally no time.
fig, axes = plt.subplots(nrows=3, ncols=1) for i, c in enumerate(df.columns): df[c].plot(kind='bar', ax=axes[i], figsize=(12, 10), title=c) plt.savefig('EU1.png', bbox_inches='tight')
The output figure looks like this:
Customization with matplotlib directives
While this is an acceptable plot for the first steps of data exploration, the figure is not really publication-ready. It also looks very much “academic” and lacks that subtle flair that infographics in mainstream media have. Over the next paragraphs we will turn this plot into a much more beautiful object by playing around with the options that matplotlib supplies.
Let us first start by creating a figure and an axis object that will contain our subfigure:
# Create a figure of given size fig = plt.figure(figsize=(16,12)) # Add a subplot ax = fig.add_subplot(111) # Set title ttl = 'Population, size and age expectancy in the European Union'
Colors are very important for data visualizations. By default, the matplotlib color palette offers solid hues, which can be softened by applying transparencies. Similarly, the default colorbars can be customized to match our taste (see below how one can define a custom-made color map with a gradient that softly changes from orange to gray-blue hues).
# Set color transparency (0: transparent; 1: solid) a = 0.7 # Create a colormap customcmap = [(x/24.0, x/48.0, 0.05) for x in range(len(df))]
The main plotting instruction in our figure uses the pandas plot wrapper. In the initialization options, we specify the type of plot (horizontal bar), the transparency, the color of the bars following the above-defined custom color map, the x-axis limits and the figure title. We also set the color of the bar borders to white for a cleaner look.
# Plot the 'population' column as horizontal bar plot df['population'].plot(kind='barh', ax=ax, alpha=a, legend=False, color=customcmap, edgecolor='w', xlim=(0,max(df['population'])), title=ttl)
After this simple pandas plot directive, the figure already looks very promising. Note that, because we sorted the data frame by life expectancy and applied a gradient color map, the color of the different bars in itself carries information. We will explicitly label that information below when constructing a color bar. For now we want to remove the grid, frame and axes lines from our plot, as well as customize its title and x,y axes labels.
# Remove grid lines (dotted lines inside plot) ax.grid(False) # Remove plot frame ax.set_frame_on(False) # Pandas trick: remove weird dotted line on axis ax.lines[0].set_visible(False) # Customize title, set position, allow space on top of plot for title ax.set_title(ax.get_title(), fontsize=26, alpha=a, ha='left') plt.subplots_adjust(top=0.9) ax.title.set_position((0,1.08)) # Set x axis label on top of plot, set label text ax.xaxis.set_label_position('top') xlab = 'Population (in millions)' ax.set_xlabel(xlab, fontsize=20, alpha=a, ha='left') ax.xaxis.set_label_coords(0, 1.04) # Position x tick labels on top ax.xaxis.tick_top() # Remove tick lines in x and y axes ax.yaxis.set_ticks_position('none') ax.xaxis.set_ticks_position('none') # Customize x tick lables xticks = [5,10,20,50,80] ax.xaxis.set_ticks(xticks) ax.set_xticklabels(xticks, fontsize=16, alpha=a) # Customize y tick labels yticks = [item.get_text() for item in ax.get_yticklabels()] ax.set_yticklabels(yticks, fontsize=16, alpha=a) ax.yaxis.set_tick_params(pad=12)
So far, the lenghts of our horizontal bars display the population (in millions) of the EU countries. All bars have the same height (which is set to 50% of the total space between bars by default by pandas). An interesting idea is to use the height of the bars to display further data. If we could made the bar height dependent on, say, the countries’ extension, we would be adding an supplementary piece of information to the plot. This is possible in matplotlib by accessing the elements that contain the bars and assigning them a specific height in a for
loop. Each bar is an element of the class Rectangle, and all the corresponding class methods can be applied to it. For assigning a given height according to each country’s extension, we code a simple linear interpolation and create a lambda
function to apply it.
# Set bar height dependent on country extension # Set min and max bar thickness (from 0 to 1) hmin, hmax = 0.3, 0.9 xmin, xmax = min(df['extension']), max(df['extension']) # Function that interpolates linearly between hmin and hmax f = lambda x: hmin + (hmax-hmin)*(x-xmin)/(xmax-xmin) # Make array of heights hs = [f(x) for x in df['extension']] # Iterate over bars for container in ax.containers: # Each bar has a Rectangle element as child for i,child in enumerate(container.get_children()): # Reset the lower left point of each bar so that bar is centered child.set_y(child.get_y()- 0.125 + 0.5-hs[i]/2) # Attribute height to each Recatangle according to country's size plt.setp(child, height=hs[i])
Having added this “dimension” to the plot, we need a way of labelling the information so that the countries’ extension is understandable. A legend would be the ideal solution, but since our plotting directive was set to display the column ['population']
, we can not use the default. We can construct a “fake” legend though, and custom-made its handles to roughly match the height of the bars. We position the legend in the lower right part of our plot.
# Legend # Create fake labels for legend l1 = Line2D([], [], linewidth=6, color='k', alpha=a) l2 = Line2D([], [], linewidth=12, color='k', alpha=a) l3 = Line2D([], [], linewidth=22, color='k', alpha=a) # Set three legend labels to be min, mean and max of countries extensions # (rounded up to 10k km2) rnd = 10000 labels = [str(int(round(l/rnd)*rnd)) for l in min(df['extension']), mean(df['extension']), max(df['extension'])] # Position legend in lower right part # Set ncol=3 for horizontally expanding legend leg = ax.legend([l1, l2, l3], labels, ncol=3, frameon=False, fontsize=16, bbox_to_anchor=[1.1, 0.11], handlelength=2, handletextpad=1, columnspacing=2, title='Size (in km2)') # Customize legend title # Set position to increase space between legend and labels plt.setp(leg.get_title(), fontsize=20, alpha=a) leg.get_title().set_position((0, 10)) # Customize transparency for legend labels [plt.setp(label, alpha=a) for label in leg.get_texts()]
Finally, there is another piece of information in the plot that needs to be labelled, and that is the color map indicating the average life expectancy in the EU countries. Since we used a custom-made color map, the regular call to plt.colorbar()
would not work. We need to create a LinearSegmentedColormap instead and “trick” matplotlib to display it as a colorbar. Then we can use the usual customization methods from colorbar
to set fonts, transparency, position and size of the diverse elements in the color legend.
# Create a fake colorbar ctb = LinearSegmentedColormap.from_list('custombar', customcmap, N=2048) # Trick from http://stackoverflow.com/questions/8342549/ # matplotlib-add-colorbar-to-a-sequence-of-line-plots sm = plt.cm.ScalarMappable(cmap=ctb, norm=plt.normalize(vmin=72, vmax=84)) # Fake up the array of the scalar mappable sm._A = [] # Set colorbar, aspect ratio cbar = plt.colorbar(sm, alpha=0.05, aspect=16, shrink=0.4) cbar.solids.set_edgecolor("face") # Remove colorbar container frame cbar.outline.set_visible(False) # Fontsize for colorbar ticklabels cbar.ax.tick_params(labelsize=16) # Customize colorbar tick labels mytks = range(72,86,2) cbar.set_ticks(mytks) cbar.ax.set_yticklabels([str(a) for a in mytks], alpha=a) # Colorbar label, customize fontsize and distance to colorbar cbar.set_label('Age expectancy (in years)', alpha=a, rotation=270, fontsize=20, labelpad=20) # Remove color bar tick lines, while keeping the tick labels cbarytks = plt.getp(cbar.ax.axes, 'yticklines') plt.setp(cbarytks, visible=False)
The final and most rewarding step consists of saving the figure in our preferred format.
# Save figure in png with tight bounding box plt.savefig('EU.png', bbox_inches='tight', dpi=300)
The final result looks this beautiful:
Table-top data experiment take-away message
When producing a plot based on multidimensional data, it is a good idea to resort to shapes and colors that visually guide us through the variables on display. Matplotlib offers a high level of customization for all details of a plot, albeit the truth is that finding exactly which knob to tweak might be at times bewildering. Beautiful plots can be created by experimenting with various settings, among which hues, transparencies and simple layouts are the focal points. The results are publication-ready figures with open-source software that can be easily replicated by means of structured python code.
Put Some Pandas in Your Python
Every scientist needs tools to perform their experiments, and data scientists are no exception. Much advice has been written to answer the question of what the best stack of tools for machine learning, data analysis and big data is. At least for the past few years, the champions in the programming language league seem to be R, python and SQL. SQL needs no introduction, being the standard tool for managing records held in relational databases. R is undeniably the language of choice for most data analysts and statisticians; alone the size of its community of users and the thousands of contributed libraries and add-on packages turn it into a reliable workhorse for a broad range of tasks. But one thing R has not, and it is a consistent syntax. To use the words of somebody else, the R language is pathologically flexible, and that might be very confusing to new users.
Python as a language for data science
Python, on the contrary, might not be king when it comes to statistics-specific problems; it is rather a general-purpose programming language. Luis Pedro Coelho summarizes it well:
Python may often be the second best choice: for linear algebra, Matlab may have nicer syntax; for statistics, R is probably nicer; for heavy regular expression usage, Perl (ugh) might still be nicer; if you want speed, Fortran or C(++) may be a better choice. To design a webpage; perhaps you want node.js. Python is not perfect for any of these, but is acceptable for all of them.
Still and all, Python is growing to include an increasingly mature set of packages for data manipulation and analysis. One that has gotten a respectable amount of attention lately is pandas, a library that offers data structures and operations for manipulating numerical tables and time series. Its creator, Wes McKinney, regularly posts materials and tutorials in his blog, and has written a very handy book on data analysis with python. In combination with the ipython notebook, pandas provides an easy environment to develop and present data analysis routines.
Data analysis with pandas: a beginners tutorial
Let us have a brief look into the basics of pandas with a beginners tutorial. Installation of both ipython and pandas is trivial via pip or anaconda, and superior to using the standard Linux apt-get install
utility, since it guarantees more recent versions. The latest releases of pandas and ipython are 0.12.0 and 1.1.0 respectively. Once installed, the notebook starts in the browser when invoking the following command from the terminal:
$ ipython notebook --pylab inline
Pandas has very neat routines to load datasets in various formats, which can be browsed by typing pd.read*?
in ipython. For this table-top experiment we will read a CSV file containing attributes of photo cameras, found in the Aviz project page. The data specifies model, weight, focal length and price, among others, and can be downloaded from the Aviz wiki.
We load the file indicating that the fields are separated by semicolons and rename the columns with descriptive labels for the variables. This creates a DataFrame, a fundamental structure that stores entries in tabular form, with an index (dataframe.index
) for the rows and multiple columns for the different variables (dataframe.columns
). Each column can be retrieved by dictionary-like notation (dataframe['column_name']
) or by attribute (dataframe.column_name
). The method head(n)
applied to a data frame displays the first n rows of data (default = 5). The method apply(f)
allows to apply a function to each column or row. For our camera dataset, we want to access also the maker, which corresponds to the first word of the model name. Thus, we create a new column 'Maker'
by operating on the existing column 'Model'
.
import pandas as pd df = pd.read_csv('Camera.csv', sep=';') columns = ['Model', 'Date', 'MaxRes', 'LowRes', 'EffPix', 'ZoomW', 'ZoomT', 'NormalFR', 'MacroFR', 'Storage', 'Weight', 'Dimensions', 'Price'] df.columns = columns df['Maker'] = df['Model'].apply(lambda s:s.split()[0]) df[['Maker','Model','Date','MaxRes','LowRes','Weight','Dimensions','Price']].head()
Maker | Model | Date | MaxRes | LowRes | Weight | Dimensions | Price | |
---|---|---|---|---|---|---|---|---|
0 | Agfa | Agfa ePhoto 1280 | 1997 | 1024 | 640 | 420 | 95 | 179 |
1 | Agfa | Agfa ePhoto 1680 | 1998 | 1280 | 640 | 420 | 158 | 179 |
2 | Agfa | Agfa ePhoto CL18 | 2000 | 640 | 0 | 0 | 0 | 179 |
3 | Agfa | Agfa ePhoto CL30 | 1999 | 1152 | 640 | 0 | 0 | 269 |
4 | Agfa | Agfa ePhoto CL30 Clik! | 1999 | 1152 | 640 | 300 | 128 | 1299 |
A number of handy operations can be performed on a data frame with virtually no effort:
- Sorting data: display 5 most recent models
df.sort(['Date'], ascending = False).head()
- Filtering columns by value: show only models made by Nikon
df[df['Maker'] == 'Nikon']
- Filtering columns by range of values: return cameras with prices above 350 and below 500
df[(df['Price'] > 350) & (df['Price'] <= 500)]
- Get statistical descriptions of the data set: find maxima, minima, averages, standard deviations, percentiles
df[['MaxRes','LowRes','Storage','Weight','Dimensions','Price']].describe()
MaxRes | LowRes | Storage | Weight | Dimensions | Price | |
---|---|---|---|---|---|---|
count | 1038.000000 | 1038.000000 | 1036.000000 | 1036.000000 | 1036.000000 | 1038.000000 |
mean | 2474.672447 | 1773.936416 | 17.447876 | 319.265444 | 105.363417 | 457.384393 |
std | 759.513608 | 830.897955 | 27.440655 | 260.410137 | 24.262761 | 760.452918 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 14.000000 |
25% | 2048.000000 | 1120.000000 | 8.000000 | 180.000000 | 92.000000 | 149.000000 |
50% | 2560.000000 | 2048.000000 | 16.000000 | 226.000000 | 101.000000 | 199.000000 |
75% | 3072.000000 | 2560.000000 | 20.000000 | 350.000000 | 115.000000 | 399.000000 |
max | 5616.000000 | 4992.000000 | 450.000000 | 1860.000000 | 240.000000 | 7999.000000 |
Plotting data frames with pandas
Pandas comes with handy wrappers around standard matplotlib routines that allow to plot data frames very easily. In the example below we create a histogram of all camera prices under 500 in our data set. Most camera models are priced in the bracket between 100 and 150 (units are not provided in the metadata of this set, but we might safely assume that we are talking about US dollars here).
matplotlib.rcParams.update({'font.size': 16}) df[(df['Price'] < 500)][['Price']].hist(figsize=(8,5), bins=10, alpha=0.5) plt.title('Histogram of camera prices\n') plt.savefig('PricesHist.png', bbox_inches='tight')
Grouping data with pandas
A powerful way of manipulating data in a frame is via processes that involve splitting it, applying functions to groups and combining the results. This is the “group by” approach that most SQL users will immediately recognize. Here is a clear exposition of how groupby works in pandas. In our table-top experiment, we group the data according to release year of the models, and study the evolution of the remaining variables with time. To the grouped structure gDate
we apply the aggregating function mean()
, which averages the non-grouped variables year by year. Quite reassuringly, we find out that photo cameras have become lighter over the past decade, while pixels and storage of newer models have steadily increased.
gDate = df[df['Date'] > 1998].groupby('Date').mean() dates = [str(s) for s in gDate.index] fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(20,5)) cols = ['b', 'r', 'g'] vars = ['EffPix', 'Weight', 'Storage'] titles = ['effective pixels', 'weight', 'storage'] for i, var in enumerate(vars): gDate[[var]].plot(ax=axes[i], alpha=0.5, legend=False, lw=4, c=cols[i]) axes[i].set_xticklabels(dates, rotation=40) axes[i].set_title('Evolution of %s\n' % titles[i]) plt.savefig('CameraEvolution.png', bbox_inches='tight')
Grouping by maker instead, we can see which ones make the smallest and most economical models, on average.
gMak = df.groupby('Maker').median() gMak.index.name = '' fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(18,8)) c = ['y','c'] vars = ['Dimensions', 'Price'] for i, var in enumerate(vars): gMak[[var]].plot(kind='barh', ax=axes[i], alpha=0.5, legend=False, color=c[i]) axes[i].set_title('Average %s by maker\n' % vars[i]) axes[i].grid(False) plt.savefig('MeanDimensionsPrices.png', bbox_inches='tight')
Pandas provides many more functionalities, and this is just a first look at them here at The Data Science Lab. We will continue the exploration of this and other datasets in due time. In the meanwhile, a quick google search for pandas shall be enough to keep us entertained.