Here’s an excerpt:

]]>The Louvre Museum has 8.5 million visitors per year. This blog was viewed about

74,000times in 2014. If it were an exhibit at the Louvre Museum, it would take about 3 days for that many people to see it.

Although the gap statistic, based on a paper by Tibshirani et al was shown to find optimal values for the number of clusters in a variety of cases when the clusters where globular and mildly disjointed, its performance might be hampered by the need of perfoming Monte Carlo simulations to estimate the reference datasets. A reader of this blog, Jonathan Stray, pointed out a potentially superior method for selecting the K in k-means clustering, so let us implement it and compare.

The approach suggested by our reader is based on a publication by Pham, Dimov and Nguyen from 2004. The article is very much worth reading, as it includes an explanation of the drawbacks of the standard k-means algorithm as well as a comprehensive survey on different methods that have been proposed for selecting an optimal number of clusters.

In section 3 of the paper, the authors justify the introduction of a function to evaluate the quality of the resulting clustering and help decide on the optimal value of for each data set. Quoting from the paper:

A data set with objects could be grouped into any number of clusters between 1 and , which would correspond to the lowest and the highest levels of detail respectively. By specifying different values, it is possible to assess the results of grouping objects into various numbers of clusters. From this evaluation, more than one value could be recommended to users, but the ﬁnal selection is made by them.

The goal of a clustering algorithm is to identify regions in which the data points are concentrated. It is also important to analyze the internal distribution of each cluster as well as its relation to other clusters in the data set. The distorsion of a cluster is a measure of the distance between points in a cluster and its centroid:

.

The global impact of all clusters’ distortions is given by the quantity

.

The authors Pham et al. proceed to discuss further constrains that the sought-after function should verify for it to be informative to the problem of selection of K. They finally arrive at the following definition:

is the number of dimensions (attributes) of the data set and is a weight factor. With this definition, is the ratio of the real distortion to the estimated distortion and decreases when there are areas of concentration in the data distribution. Values of that yield small can be regarded as giving well-deﬁned clusters.

Our implementation of the Pham et al. procedure builds on the `KMeans`

and `KPlusPlus`

python classes defined in our article on the k-means++ algorithm. We define a new class that inherits from `KPlusPlus`

and contains a function to compute :

class DetK(KPlusPlus): def fK(self, thisk, Skm1=0): X = self.X Nd = len(X[0]) a = lambda k, Nd: 1 - 3/(4*Nd) if k == 2 else a(k-1, Nd) + (1-a(k-1, Nd))/6 self.find_centers(thisk, method='++') mu, clusters = self.mu, self.clusters Sk = sum([np.linalg.norm(mu[i]-c)**2 \ for i in range(thisk) for c in clusters[i]]) if thisk == 1: fs = 1 elif Skm1 == 0: fs = 1 else: fs = Sk/(a(thisk,Nd)*Skm1) return fs, Sk

Note the recursive definition of (variable `a`

in the code snapshot above) and the fact that the computation of for requires knowing the value of , which is passed as input parameter to the function.

This article aims at showing that the Pham et al. procedure works and is computationally more efficient than the gap statistic. Therefore, we will code up the algorithm for the gap statistic within the same class `DetK`

, so that we can run both procedures simultaneously. The full code is below the fold:

class DetK(KPlusPlus): def fK(self, thisk, Skm1=0): X = self.X Nd = len(X[0]) a = lambda k, Nd: 1 - 3/(4*Nd) if k == 2 else a(k-1, Nd) + (1-a(k-1, Nd))/6 self.find_centers(thisk, method='++') mu, clusters = self.mu, self.clusters Sk = sum([np.linalg.norm(mu[i]-c)**2 \ for i in range(thisk) for c in clusters[i]]) if thisk == 1: fs = 1 elif Skm1 == 0: fs = 1 else: fs = Sk/(a(thisk,Nd)*Skm1) return fs, Sk def _bounding_box(self): X = self.X xmin, xmax = min(X,key=lambda a:a[0])[0], max(X,key=lambda a:a[0])[0] ymin, ymax = min(X,key=lambda a:a[1])[1], max(X,key=lambda a:a[1])[1] return (xmin,xmax), (ymin,ymax) def gap(self, thisk): X = self.X (xmin,xmax), (ymin,ymax) = self._bounding_box() self.init_centers(thisk) self.find_centers(thisk, method='++') mu, clusters = self.mu, self.clusters Wk = np.log(sum([np.linalg.norm(mu[i]-c)**2/(2*len(c)) \ for i in range(thisk) for c in clusters[i]])) # Create B reference datasets B = 10 BWkbs = zeros(B) for i in range(B): Xb = [] for n in range(len(X)): Xb.append([random.uniform(xmin,xmax), \ random.uniform(ymin,ymax)]) Xb = np.array(Xb) kb = DetK(thisk, X=Xb) kb.init_centers(thisk) kb.find_centers(thisk, method='++') ms, cs = kb.mu, kb.clusters BWkbs[i] = np.log(sum([np.linalg.norm(ms[j]-c)**2/(2*len(c)) \ for j in range(thisk) for c in cs[j]])) Wkb = sum(BWkbs)/B sk = np.sqrt(sum((BWkbs-Wkb)**2)/float(B))*np.sqrt(1+1/B) return Wk, Wkb, sk def run(self, maxk, which='both'): ks = range(1,maxk) fs = zeros(len(ks)) Wks,Wkbs,sks = zeros(len(ks)+1),zeros(len(ks)+1),zeros(len(ks)+1) # Special case K=1 self.init_centers(1) if which == 'f': fs[0], Sk = self.fK(1) elif which == 'gap': Wks[0], Wkbs[0], sks[0] = self.gap(1) else: fs[0], Sk = self.fK(1) Wks[0], Wkbs[0], sks[0] = self.gap(1) # Rest of Ks for k in ks[1:]: self.init_centers(k) if which == 'f': fs[k-1], Sk = self.fK(k, Skm1=Sk) elif which == 'gap': Wks[k-1], Wkbs[k-1], sks[k-1] = self.gap(k) else: fs[k-1], Sk = self.fK(k, Skm1=Sk) Wks[k-1], Wkbs[k-1], sks[k-1] = self.gap(k) if which == 'f': self.fs = fs elif which == 'gap': G = [] for i in range(len(ks)): G.append((Wkbs-Wks)[i] - ((Wkbs-Wks)[i+1]-sks[i+1])) self.G = np.array(G) else: self.fs = fs G = [] for i in range(len(ks)): G.append((Wkbs-Wks)[i] - ((Wkbs-Wks)[i+1]-sks[i+1])) self.G = np.array(G) def plot_all(self): X = self.X ks = range(1, len(self.fs)+1) fig = plt.figure(figsize=(18,5)) # Plot 1 ax1 = fig.add_subplot(131) ax1.set_xlim(-1,1) ax1.set_ylim(-1,1) ax1.plot(zip(*X)[0], zip(*X)[1], '.', alpha=0.5) tit1 = 'N=%s' % (str(len(X))) ax1.set_title(tit1, fontsize=16) # Plot 2 ax2 = fig.add_subplot(132) ax2.set_ylim(0, 1.25) ax2.plot(ks, self.fs, 'ro-', alpha=0.6) ax2.set_xlabel('Number of clusters K', fontsize=16) ax2.set_ylabel('f(K)', fontsize=16) foundfK = np.where(self.fs == min(self.fs))[0][0] + 1 tit2 = 'f(K) finds %s clusters' % (foundfK) ax2.set_title(tit2, fontsize=16) # Plot 3 ax3 = fig.add_subplot(133) ax3.bar(ks, self.G, alpha=0.5, color='g', align='center') ax3.set_xlabel('Number of clusters K', fontsize=16) ax3.set_ylabel('Gap', fontsize=16) foundG = np.where(self.G > 0)[0][0] + 1 tit3 = 'Gap statistic finds %s clusters' % (foundG) ax3.set_title(tit3, fontsize=16) ax3.xaxis.set_ticks(range(1,len(ks)+1)) plt.savefig('detK_N%s.png' % (str(len(X))), \ bbox_inches='tight', dpi=100)

For a first experiment comparing the Pham et al. and the gap statistic approaches, we create a data set comprising 300 points around 2 Gaussian-distributed clusters. We run both methods to select spanning the values . (The function `run`

from class `DetK`

takes a value as input and checks all values such that .) Note that every run of the k-means clustering algorithm for different values of is preceded by the k-means++ initialization algorithm, to prevent landing at suboptimal clustering solutions.

To run a full comparison of both methods, the following simple commands are invoked:

kpp = DetK(2, N=300) kpp.run(10) kpp.plot_all()

This produces the following result plots:

According to Pham et al. lower values of , and especially values are an indication of cluster-like features in the data at that particular . In the case of , the global minimum of in the central plot leaves no doubt that this is the right value to choose for this particular data configuration. The gap statistic, depicted in the plot on the right, yields the same result of . Remember that the optimal with the gap statistic is the smallest value for which the gap quantity becomes positive.

Similarly, we can analyze a data set consisting of 100 points around a single cluster. The results are shown in the plots below. We observe how the function does not show any prominent valley or value for which for any of the surveyed s. According to the Pham et al. paper, this is an indication of no clustering, as is the case. The gap statistic agrees that there is no more than one cluster in this case.

Finally, let us look at two cases, both with 500 data points around 4 clusters. Below are the plots of the results:

For the data distribution on the top, one can see that the 4 clusters are positioned in such a way that they could also be interpreted as 2 clusters made of 2 subclusters each. The detects this configuration and suggests 2 possible values of , with a slight preference for over . The gap statistic changes sign at , albeit barely, and it does it again and more clearly at . In both cases, a strict application of the rules prescribed to select the correct does lead to a rather suboptimal, or at least dubious, choice.

In the bottom plot however, the 4 clusters are somehow more evenly spreaded and both algorithms succeed at identifying . The method still shows a relative minimum at , indicating a potentially alternative clustering.

If both methods to select the optimal for k-means clustering yield similar results, one should ask about the relative performance of them in real-life data science clustering problems. It is straightforward to predict that the gap statistic, with its need for running the k-means algorithm multiple times to create a Monte Carlo reference distribution, will necessarily be a poorer performer. We can easily test this hypothesis with our code by running both approaches and timing them using the IPython magic `%time`

function. For a data set with :

%time kpp.run(10, which='f')

`CPU times: user 2.72 s, sys: 0.00 s, total: 2.72 s`

Wall time: 2.90 s

%time kpp.run(10, which='gap')

`CPU times: user 51.30 s, sys: 0.01 s, total: 51.31 s`

Wall time: 51.40 s

In this particular example, the method is more than one order of magnitude more performant than the gap statistic, and this comparison looks worse for the latter the more data we take into consideration and the larger the number employed for generating the reference distributions.

The estimation of the optimal number of clusters within a set of data points is a very important problem, as most clustering algorithms need that parameter as input in order to group the data. Many methods have been proposed to find the proper , among which the approach proposed by Pham et al. in 2004 seems to offer a very straightforward and performant solution. The estimation of the function over the desired range of test values for offers an immediate way of assessing when the cluster-like features appear and allows to choose among a best value and other alternatives. A comparison in performance with the gap statistic method of Tibshirani et al. concludes that the is computationally advantageous.

]]>this algorithm does not supply information as to which K for the k-means is optimal; that has to be found out by alternative methods,

so that we went a step further and coded up the gap statistic to find the proper k for k-means clustering. In combination with the clustering algorithm, the gap statistic allows to estimate the best value for k among those in a given range.

An additional problem with the standard k-means procedure still remains though, as shown by the image on the right, where a poor random initialization of the centroids leads to suboptimal clustering:

If the target distribution is disjointedly clustered and only one instantiation of Lloyd’s algorithm is used, the danger exists that the local minimum reached is not the optimal solution.

The initialization problem for the k-means algorithm is an important practical one, and has been discussed extensively. It is desirable to augment the standard k-means clustering procedure with a robust initialization mechanism that guarantees convergence to the optimal solution.

A solution called k-means++ was proposed in 2007 by Arthur and Vassilvitskii. This algorithm comes with a theoretical guarantee to find a solution that is O(log k) competitive to the optimal k-means solution. It is also fairly simple to describe and implement. Starting with a dataset *X* of *N* points ,

- choose an initial center uniformly at random from
*X*. Compute the vector containing the square distances between all points in the dataset and : - choose a second center from
*X*randomly drawn from the probability distribution - recompute the distance vector as
- choose a successive center and recompute the distance vector as
- when exactly
*k*centers have been chosen, finalize the initialization phase and proceed with the standard k-means algorithm

The interested reader can find a review of the k-means++ algorithm at normaldeviate, a survey of implementations in several languages at rosettacode and a ready-to-use solution in pandas by Jack Maney in github.

Out python implementation of the k-means++ algorithm builds on the code for standard k-means shown in the previous post. The `KMeans`

class defined below contains all necessary functions and methods to generate toy data and run the Lloyd’s clustering algorithm on it:

class KMeans(): def __init__(self, K, X=None, N=0): self.K = K if X == None: if N == 0: raise Exception("If no data is provided, \ a parameter N (number of points) is needed") else: self.N = N self.X = self._init_board_gauss(N, K) else: self.X = X self.N = len(X) self.mu = None self.clusters = None self.method = None def _init_board_gauss(self, N, k): n = float(N)/k X = [] for i in range(k): c = (random.uniform(-1,1), random.uniform(-1,1)) s = random.uniform(0.05,0.15) x = [] while len(x) < n: a,b = np.array([np.random.normal(c[0],s),np.random.normal(c[1],s)]) # Continue drawing points from the distribution in the range [-1,1] if abs(a) and abs(b)<1: x.append([a,b]) X.extend(x) X = np.array(X)[:N] return X def plot_board(self): X = self.X fig = plt.figure(figsize=(5,5)) plt.xlim(-1,1) plt.ylim(-1,1) if self.mu and self.clusters: mu = self.mu clus = self.clusters K = self.K for m, clu in clus.items(): cs = cm.spectral(1.*m/self.K) plt.plot(mu[m][0], mu[m][1], 'o', marker='*', \ markersize=12, color=cs) plt.plot(zip(*clus[m])[0], zip(*clus[m])[1], '.', \ markersize=8, color=cs, alpha=0.5) else: plt.plot(zip(*X)[0], zip(*X)[1], '.', alpha=0.5) if self.method == '++': tit = 'K-means++' else: tit = 'K-means with random initialization' pars = 'N=%s, K=%s' % (str(self.N), str(self.K)) plt.title('\n'.join([pars, tit]), fontsize=16) plt.savefig('kpp_N%s_K%s.png' % (str(self.N), str(self.K)), \ bbox_inches='tight', dpi=200) def _cluster_points(self): mu = self.mu clusters = {} for x in self.X: bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \ for i in enumerate(mu)], key=lambda t:t[1])[0] try: clusters[bestmukey].append(x) except KeyError: clusters[bestmukey] = [x] self.clusters = clusters def _reevaluate_centers(self): clusters = self.clusters newmu = [] keys = sorted(self.clusters.keys()) for k in keys: newmu.append(np.mean(clusters[k], axis = 0)) self.mu = newmu def _has_converged(self): K = len(self.oldmu) return(set([tuple(a) for a in self.mu]) == \ set([tuple(a) for a in self.oldmu])\ and len(set([tuple(a) for a in self.mu])) == K) def find_centers(self, method='random'): self.method = method X = self.X K = self.K self.oldmu = random.sample(X, K) if method != '++': # Initialize to K random centers self.mu = random.sample(X, K) while not self._has_converged(): self.oldmu = self.mu # Assign all points in X to clusters self._cluster_points() # Reevaluate centers self._reevaluate_centers()

To initalize the board with n data points normally distributed around k centers, we call `kmeans = KMeans(k, N=n)`

.

kmeans = KMeans(3, N=200) kmeans.find_centers() kmeans.plot_board()

The snippet above creates a board with 200 points around 3 clusters. The call to the `find_centers()`

function runs the standard k-means algorithm initializing the centroids to 3 random points. Finally, the function `plot_board()`

produces a plot of the data points as clustered by the algorithm, with the centroids marked as stars. In the image below we can see the results of running the algorithm twice. Due to the random initialization of the standard k-means, the correct solution is found some of the times (right panel) whereas in some cases a suboptimal end point is reached instead (left panel).

Let us now implement the k-means++ algorithm in its own class, which inherits from the `class Kmeans`

defined above.

class KPlusPlus(KMeans): def _dist_from_centers(self): cent = self.mu X = self.X D2 = np.array([min([np.linalg.norm(x-c)**2 for c in cent]) for x in X]) self.D2 = D2 def _choose_next_center(self): self.probs = self.D2/self.D2.sum() self.cumprobs = self.probs.cumsum() r = random.random() ind = np.where(self.cumprobs >= r)[0][0] return(self.X[ind]) def init_centers(self): self.mu = random.sample(self.X, 1) while len(self.mu) < self.K: self._dist_from_centers() self.mu.append(self._choose_next_center()) def plot_init_centers(self): X = self.X fig = plt.figure(figsize=(5,5)) plt.xlim(-1,1) plt.ylim(-1,1) plt.plot(zip(*X)[0], zip(*X)[1], '.', alpha=0.5) plt.plot(zip(*self.mu)[0], zip(*self.mu)[1], 'ro') plt.savefig('kpp_init_N%s_K%s.png' % (str(self.N),str(self.K)), \ bbox_inches='tight', dpi=200)

To run the k-means++ initialization stage using this class and visualize the centers found by the algorithm, we simply do:

kplusplus = KPlusPlus(5, N=200) kplusplus.init_centers() kplusplus.plot_init_centers()

Let us explore what the function `init_centers()`

is actually doing: to begin with, a random point is chosen as first center from the *X* data points as `random.sample(self.X, 1)`

. Then, the successive centers are picked, stopping when we have *K*=5 of them. The procedure to choose the next most suitable center is coded up in the `_choose_next_center()`

function. As we described above, the next center is drawn from a distribution given by the normalized distance vector . To implement such a probability distribution, we compute the cumulative probabilities for choosing each of the *N* points in *X*. These cumulative probabilities are partitions in the interval [0,1] with length equal to the probability of the corresponding point being chosen as a center, as explained in this stackoverflow thread. Therefore, by picking a random value and finding the point corresponding to the segment of the partition where that value falls, we are effectively choosing a point drawn according to the desired probability distribution. On the right is a plot showing the results of the algorithm for 200 points and 5 clusters.

Finally let us compare the results of k-means with random initialization and k-means++ with proper seeding, using the following code snippets:

# Random initialization kplusplus.find_centers() kplusplus.plot_board() # k-means++ initialization kplusplus.find_centers(method='++') kplusplus.plot_board()

The standard algorithm with random initialization in a particular instantiation (left panel) fails at identifying the 5 optimal centroids for the clustering, whereas the k-means++ initialization (right panel) succeeds in doing so. By picking up a specific and not random set of centroids to initiate the clustering process, the k-means++ algorithm also reaches convergence faster, guaranteed by the theorems proved in the Arthur and Vassilvitskii article.

The k-means++ method for finding a proper seeding for the choice of initial centroids yields considerable improvement over the standard Lloyd’s implementation of the k-means algorithm. The initial selection in k-means++ takes extra time and involves choosing centers in a successive order and drawing them from a particular probability distribution that has to be recomputed at each step. However, by doing so, the k-means part of the algorithm converges very quickly after this seeding and thus the whole procedure actually runs in a shorter computation time. The combination of the k-means++ initialization stage with the standard Lloyd’s algorithm, together with additional various techniques to find out an optimal value for the ideal number of clusters, poses a robust way to solve the complete problem of clustering data points.

]]>In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering or cluster analysis, and involves grouping data into categories based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional vector space). (Wikipedia)

At The Data Science Lab, we have already reviewed some basics of *unsupervised* classification with the Lloyd algorithm for k-means clustering and have investigated how to find the appropriate number of clusters. Today’s post will be devoted to a classical machine learning algorithm for *supervised* classification: the perceptron learning algorithm.

The perceptron learning algorithm was invented in the late 1950s by Frank Rosenblatt. It belongs to the class of linear classifiers, this is, for a data set classified according to binary categories (which we will assume to be labeled +1 and -1), the classifier seeks to divide the two classes by a linear separator. The separator is a *(n-1)*-dimensional hyperplane in a *n*-dimensional space, in particular it is a line in the plane and a plane in the 3-dimensional space.

Our data set will be assumed to consist of *N* observations characterized by *d* features or attributes, for . The problem of binary classifying these data points can be translated to that of finding a series of weights such that all vectors verifying

are assigned to one of the classes whereas those verifying

are assigned to the other, for a given threshold value . If we rename and introduce an artificial coordinate in our vectors , we can write the perceptron separator formula as

Note that is notation for the scalar product between vectors and . Thus the problem of classifying is that of finding the vector of weights given a training data set of *N* vectors with their corresponding labeled classification vector .

The learning algorithm for the perceptron is online, meaning that instead of considering the entire data set at the same time, it only looks at one example at a time, processes it and goes on to the next one. The algorithm starts with a guess for the vector (without loss of generalization one can begin with a vector of zeros). It then assesses how good of a guess that is by comparing the predicted labels with the actual, correct labels (remember that those are available for the training test, since we are doing supervised learning). As long as there are misclassified points, the algorithm corrects its guess for the weight vector by updating the weights in the correct direction, until all points are correctly classified.

That direction is as follows: given a labeled training data set, if is the guessed weight vector and is an incorrectly classified point with , then the weight is updated to . This is illustrated in the plot on the right, taken from this clear article on the perceptron.

A nice feature of the perceptron learning rule is that if there exist a set of weights that solve the problem (i.e. if the data is linearly separable), then the perceptron will find these weights.

For our python implementation we will use a trivial example on two dimensions: within the space, we define two random points and draw the line that joins them. The general equation of a line given two points in it, and , is where can be written in terms of the two points. Defining a vector , any point belongs to the line if , where . Points for which the dot product is positive fall on one side of the line, negatives fall on the other.

This procedure automatically divides the plane linearly in two regions. We randomly choose *N* points in this space and classify them as +1 or -1 according to the dividing line defined before. The `Perceptron`

class defined below is initialized exactly in this way. The perceptron learning algorithm is implemented in the `pla`

function, and the classification error, defined as the fraction of misclassified points, is in `classification_error`

. The code is as follows:

import numpy as np import random import os, subprocess class Perceptron: def __init__(self, N): # Random linearly separated data xA,yA,xB,yB = [random.uniform(-1, 1) for i in range(4)] self.V = np.array([xB*yA-xA*yB, yB-yA, xA-xB]) self.X = self.generate_points(N) def generate_points(self, N): X = [] for i in range(N): x1,x2 = [random.uniform(-1, 1) for i in range(2)] x = np.array([1,x1,x2]) s = int(np.sign(self.V.T.dot(x))) X.append((x, s)) return X def plot(self, mispts=None, vec=None, save=False): fig = plt.figure(figsize=(5,5)) plt.xlim(-1,1) plt.ylim(-1,1) V = self.V a, b = -V[1]/V[2], -V[0]/V[2] l = np.linspace(-1,1) plt.plot(l, a*l+b, 'k-') cols = {1: 'r', -1: 'b'} for x,s in self.X: plt.plot(x[1], x[2], cols[s]+'o') if mispts: for x,s in mispts: plt.plot(x[1], x[2], cols[s]+'.') if vec != None: aa, bb = -vec[1]/vec[2], -vec[0]/vec[2] plt.plot(l, aa*l+bb, 'g-', lw=2) if save: if not mispts: plt.title('N = %s' % (str(len(self.X)))) else: plt.title('N = %s with %s test points' \ % (str(len(self.X)),str(len(mispts)))) plt.savefig('p_N%s' % (str(len(self.X))), \ dpi=200, bbox_inches='tight') def classification_error(self, vec, pts=None): # Error defined as fraction of misclassified points if not pts: pts = self.X M = len(pts) n_mispts = 0 for x,s in pts: if int(np.sign(vec.T.dot(x))) != s: n_mispts += 1 error = n_mispts / float(M) return error def choose_miscl_point(self, vec): # Choose a random point among the misclassified pts = self.X mispts = [] for x,s in pts: if int(np.sign(vec.T.dot(x))) != s: mispts.append((x, s)) return mispts[random.randrange(0,len(mispts))] def pla(self, save=False): # Initialize the weigths to zeros w = np.zeros(3) X, N = self.X, len(self.X) it = 0 # Iterate until all points are correctly classified while self.classification_error(w) != 0: it += 1 # Pick random misclassified point x, s = self.choose_miscl_point(w) # Update weights w += s*x if save: self.plot(vec=w) plt.title('N = %s, Iteration %s\n' \ % (str(N),str(it))) plt.savefig('p_N%s_it%s' % (str(N),str(it)), \ dpi=200, bbox_inches='tight') self.w = w def check_error(self, M, vec): check_pts = self.generate_points(M) return self.classification_error(vec, pts=check_pts)

To start a run of the perceptron with 20 data points and visualize the initial configuration we simply initialize the `Perceptron`

class and call the `plot`

function:

p = Perceptron(20) p.plot()

On the right is the plane that we obtain, divided in two by the black line. Red points are labeled as +1 while blue ones are -1. The purpose of the perceptron learning algorithm is to “learn” a linear classifier that correctly separates red from blue points given the labeled set of 20 points shown in the figure. This is, we want to learn the black line as faithfully as possible.

The call to `p.pla()`

runs the algorithm and stores the final weights learned in `p.w`

. To save a plot of each of the iterations, we can use the option `p.pla(save=True)`

. The following snippet will concatenate all images together to produce an animated gif of the running algorithm:

basedir = '/my/output/directory/' os.chdir(basedir) pngs = [pl for pl in os.listdir(basedir) if pl.endswith('png')] sortpngs = sorted(pngs, key=lambda a:int(a.split('it')[1][:-4])) basepng = pngs[0][:-8] [sortpngs.append(sortpngs[-1]) for i in range(4)] comm = ("convert -delay 50 %s %s.gif" % (' '.join(sortpngs), basepng)).split() proc = subprocess.Popen(comm, stdin = subprocess.PIPE, stdout = subprocess.PIPE) (out, err) = proc.communicate()

Below we can see how the algorithm tries successive values for the weight vector, leading to a succession of guesses for the linear separator, which are plotted in green. For as long as the green line misclassifies points (meaning that it does not separate the blue from the right points correctly), the perceptron keeps updating the weights in the manner described above. Eventually all points are on the correct side of the guessed line, the classification error in the training data set ( for the in-sample points) is thus zero and the algorithm converges and stops.

Clearly the final guessed green line after the 7 iterations does separate the training data points correctly but it does not completely agree with the target black line. An error in classifying not-seen data points is bound to exist ( for out-of-sample points), and we can quantify it easily by evaluating the performance of the linear classifier on fresh, unseen data points:

p.plot(p.generate_points(500), p.w, save=True)

In this image we can observe how, even if for the training points represented by the large dots, , as shown by the small red and blue dots that fall on one side of the black (target) line but are incorrectly classified by the green (guessed) one. The exact out-of-sample error is given by the area between both lines. This can be thus computed analytically and exactly. But it can also be estimated with a repeated Monte Carlo sampling:

err = [] for i in range(100): err.append(p.check_error(500, p.w)) np.mean(err)

`0.0598200`

The perceptron algorithm has thus learned a linear binary classifier that correctly classifies data in 94% of the cases, having an out-of-sample error rate of 6%.

The perceptron learning algorithm is a classical example of binary linear supervised classifier. Its implementation involves finding a linear boundary that completely separates points belonging to the two classes. If the data is linearly separable, then the procedure will converge to a weight vector that separates the data. (And if the data is inseparable, the PLA will never converge.) The perceptron is thus fundamentally limited in that its decision boundaries can only be linear. Eventually we will explore other methods that overcome this limitation, either combining multiple perceptrons in a single framework (neural networks) or by mapping features in an efficient way (kernels).

]]>List comprehension in python is extremely flexible and powerful. Let us practice some more with further neat examples of it:

[x**2 for x in range(10)]

`[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]`

[x for x in range(100) if x%5 == 0]

`[0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95]`

[x for x in range(50) if x%3 == 0 and x%6 != 0]

`[3, 9, 15, 21, 27, 33, 39, 45]`

import string punct = string.punctuation + ' ' vowels = "aeiou" phrase = "On second thought, let's not go to Camelot. It is a silly place." set([c for c in phrase.lower() if c not in vowels and c not in punct])

`{'c', 'd', 'g', 'h', 'l', 'm', 'n', 'p', 's', 't', 'y'}`

[w[0] for w in phrase.split()]

`['O', 's', 't', 'l', 'n', 'g', 't', 'C', 'I', 'i', 'a', 's', 'p']`

"".join([c if c not in vowels else '0' for c in phrase])

`"On s0c0nd th00ght, l0t's n0t g0 t0 C0m0l0t. It 0s 0 s0lly pl0c0."`

words1 = ['Lancelot', 'Robin', 'Galahad'] words2 = ['Camelot', 'Assyria'] [(w1,w2) for w1 in words1 for w2 in words2]

`[('Lancelot', 'Camelot'), ('Lancelot', 'Assyria'), ('Robin', 'Camelot'), ('Robin', 'Assyria'), ('Galahad', 'Camelot'), ('Galahad', 'Assyria')]`

I will update this list as more interesting and useful examples come to mind. What’s your favorite use of list comprehension and how many lines of code did it save you?

]]>As per Wikipedia, the universe of the Game of Life is an infinite two-dimensional grid of square cells, each of which is in one of two possible states, alive or dead. Every cell interacts with its eight neighbors, which are the cells that are horizontally, vertically, or diagonally adjacent. At each step in time, the following transitions occur:

- Any live cell with fewer than two live neighbors dies, as if caused by under-population.
- Any live cell with two or three live neighbors lives on to the next generation.
- Any live cell with more than three live neighbors dies, as if by overcrowding.
- Any dead cell with exactly three live neighbors becomes a live cell, as if by reproduction.

Our python implementation will use a two-dimensional numpy array to store the grid representing the universe, with values 1 for live and 0 for dead cells. We will code a `init`

function for the initialization of the grid and a `evolve`

routine for the evolution of the universe. A very simple way of initializing the grid to random values, allowing for variable grid dimensions, is as follows:

def init_universe(rows, cols): grid = np.zeros([rows, cols]) for i in range(rows): for j in range(cols): grid[i][j] = round(random.random()) return grid

An example of a random universe created with the `init_universe(rows, cols)`

function with 600 cells distributed in 20 rows and 30 columns can be seen in the figure on the right. The call to generate the figure, with black cells representing live (or 1) states, is as follows:

grid = init_universe(20,30) ax = plt.axes() ax.matshow(grid,cmap=cm.binary) ax.set_axis_off()

Now, for the evolution logic, let us code a function that takes a universe as input, together with the parameters that regulate its evolution, and outputs the new universe after one iteration. The classical rules of the game of life set the parameters for overcrowding, under-population and reproduction as 3, 2, 3, respectively. In our implementation, we create a padding around the original universe, which allows us to define the neighbors in an easy way without having to worry whether a particular cell is at the border or not. At every position *i,j* we compute the sum of all cells in positions and then we subtract the center point at *i,j*. Then we apply the evolution logic: cells die when underpopulated or overcrowded, and new cells are born when the reproduction condition (3 alive neighbors) is fulfilled:

def evolve(grid, pars): overcrowd, underpop, reproduction = pars rows, cols = grid.shape newgrid = np.zeros([rows, cols]) neighbors = np.zeros([rows,cols]) # Auxiliary padded grid padboard = np.zeros([rows+2, cols+2]) padboard[:-2,:-2] = grid # Compute neighbours and newgrid for i in range(rows): for j in range(cols): neighbors[i][j] += sum([padboard[a][b] for a in [i-1, i, i+1] \ for b in [j-1, j, j+1]]) neighbors[i][j] -= padboard[i][j] # Evolution logic newgrid[i][j] = grid[i][j] if grid[i][j] and \ (neighbors[i][j] > overcrowd or neighbors[i][j] < underpop): newgrid[i][j] = 0 elif not grid[i][j] and neighbors[i][j] == reproduction: newgrid[i][j] = 1 return newgrid

Note that in the above code we make use of a wonderful property of arrays in python, namely that the last element of an array `arr`

can be referenced either as `arr[len(arr)-1]`

or as `arr[-1]`

. Thus, we create a padboard with 2 columns and 2 rows more than the dimensions of the grid. If n is the number of rows of the grid, the padboard has n+2 rows, which range from 0 to n+1, *or*, equivalently, from -1 to n!

For the visualization of the evolution of our random universe we could create a series of png plots and stitch them together to produce an animated gif. However, matplotlib also offers the possibility of generating animations and saving them directly in mp4 format. The code that follows is based on this very useful tutorial, which contains instructions to embed matplotlib animations directly in the ipython notebook.

pars = 3, 2, 3 rows, cols = 20, 20 fig = plt.figure() ax = plt.axes() im = ax.matshow(init_universe(rows,cols),cmap=cm.binary) ax.set_axis_off() def init(): im.set_data(init_universe(rows, cols)) def animate(i): a = im.get_array() a = evolve(a, pars) im.set_array(a) return [im]

In the code above, we have set the parameters for the evolution as described by the original logic of the game, and we initialize a matplotlib figure using the `matshow`

directive. We also need the functions `init`

and `animate`

; the latter updates the content of the plot with evolved iterations of the universe. The matplotlib call to produce an animation and save it in mp4 format is then simply:

anim = animation.FuncAnimation(fig, animate, init_func=init, frames=100, blit=True) anim.save('animation_random.mp4', fps=10) # fps = FramesPerSecond

A random initialization gives rise to the following animation, which ends in a configuration with 3 stable patterns (block, blinker and a diamond-shaped structure that settles to a blinker in 8 steps) after approximately 50 iterations.

While a random initialization often gives rise to interesting universes, there is a vast body of research devoted to classifying particular configurations that are known to evolve in a specific fashion (oscillators, stable figures, moving patterns…). A quick google search illustrates this point and leads to many resources for the interested reader. For starters, let us code up the initialization function of a “pulsar”, a type of oscillator with a 3-iteration period.

def init_universe_pulsar(): grid = zeros([15, 15]) line = zeros(15) line[3:6] = 1 line[9:12] = 1 for ind in [1,6,8,13]: grid[ind] = line grid[:,ind] = line return grid

To generate and save this universe, we need to modify the function `init`

used to produce the matplotlib animation and replace the call to `init_universe(rows, cols)`

with `init_universe_pulsar()`

. The resulting evolution can be seen in the following video:

An interesting kind of universes are those that resemble spacecrafts. There are many of them, as a visit to this page shows. Other configurations resemble guns that emit gliders forever. By far the most famous one is the Cosper glider gun, which can be generated using the following initialization function:

def init_universe_glider_gun(): glider_gun = 38*'0' + 25*'0'+'1'+12*'0' + 23*'0'+'101'+12*'0' +\ 13*'0'+'11'+6*'0'+'11'+12*'0'+'11'+'0' +\ 12*'0'+'1'+3*'0'+'1'+4*'0'+'11'+12*'0'+'11'+'0' +\ '0'+'11'+8*'0'+'1'+5*'0'+'100011'+15*'0' +\ '0'+'11'+8*'0'+'1'+'000'+'1011'+4*'0'+'101'+12*'0' +\ 11*'0'+'1000001'+7*'0'+'1'+12*'0' +\ 12*'0'+'10001'+21*'0' + 13*'0'+'11'+23*'0' + 38*'0' +\ 19*38*'0' grid = np.array([float(g) for g in glider_gun]).reshape(30,38) return grid

Once started, this glider gun evolves emitting gliders indefinitely, which move across the grid at -45 degrees and exit the universe bounding box through the bottom right corner. Bill Gosper discovered this first glider gun, which is so far the smallest one ever found, in 1970 and got 50 dollars from Conway for that. The discovery of the glider gun eventually led to the proof that Conway’s Game of Life could function as a Turing machine. A video of the Gosper glider gun in action can be seen below.

Here is a very nice visualization of yet another type of configuration in Life, the so-called “puffers”, patterns that move like a spaceship but leave debris behind as they evolve.

Conway’s game of life is a zero-player game, with evolution completely determined by its initial state, consisting on live and dead cells on a two-dimensional grid. The state of each cell varies with each iteration according to the number of populated neighbors in the adjacent cells. The game, devised in 1970, opened up a whole new area of mathematical research, the field of cellular automata, and belongs to a growing class of what are called “simulation games”. Implementing the evolution algorithm behind the game in any programming language is a classical exercise in many CS schools, is always a lot of fun, and allows to explore the huge variety of configurations that give rise to strangely addictive evolution patterns.

]]>Clustering consist of grouping objects in sets, such that objects within a cluster are as similar as possible, whereas objects from different clusters are as dissimilar as possible. Thus, the optimal clustering is somehow subjective and dependent on the characteristic used for determining similarities, as well as on the level of detail required from the partitions. For the purpose of our clustering experiment we use clusters derived from Gaussian distributions, i.e. globular in nature, and look only at the usual definition of Euclidean distance between points in a two-dimensional space to determine intra- and inter-cluster similarity.

The following measure represents the sum of intra-cluster distances between points in a given cluster containing points:

.

Adding the normalized intra-cluster sums of squares gives a measure of the compactness of our clustering:

.

This variance quantity is the basis of a naive procedure to determine the optimal number of clusters: the elbow method.

If you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters are chosen at this point, hence the “elbow criterion”.

But as Wikipedia promptly explains, this “elbow” cannot always be unambiguously identified. In this post we will show a more sophisticated method that provides a statistical procedure to formalize the “elbow” heuristic.

The gap statistic was developed by Stanford researchers Tibshirani, Walther and Hastie in their 2001 paper. The idea behind their approach was to find a way to standardize the comparison of with a null reference distribution of the data, i.e. a distribution with no obvious clustering. Their estimate for the optimal number of clusters is the value for which falls the farthest below this reference curve. This information is contained in the following formula for the gap statistic:

.

The reference datasets are in our case generated by sampling uniformly from the original dataset’s bounding box (see green box in the upper right plot of the figures below). To obtain the estimate we compute the average of copies for , each of which is generated with a Monte Carlo sample from the reference distribution. Those from the Monte Carlo replicates exhibit a standard deviation which, accounting for the simulation error, is turned into the quantity

.

Finally, the optimal number of clusters is the smallest such that .

The computation of the gap statistic involves the following steps (see original paper):

- Cluster the observed data, varying the number of clusters from , and compute the corresponding .
- Generate reference data sets and cluster each of them with varying number of clusters . Compute the estimated gap statistic .
- With , compute the standard deviation and define .
- Choose the number of clusters as the smallest such that .

Our python implementation makes use of the `find_centers(X, K)`

function defined in this post. The quantity is computed as follows:

def Wk(mu, clusters): K = len(mu) return sum([np.linalg.norm(mu[i]-c)**2/(2*len(c)) \ for i in range(K) for c in clusters[i]])

The gap statistic is implemented in the following code snapshot. Note that we use for the reference datasets and we span values of from 1 to 9.

def bounding_box(X): xmin, xmax = min(X,key=lambda a:a[0])[0], max(X,key=lambda a:a[0])[0] ymin, ymax = min(X,key=lambda a:a[1])[1], max(X,key=lambda a:a[1])[1] return (xmin,xmax), (ymin,ymax) def gap_statistic(X): (xmin,xmax), (ymin,ymax) = bounding_box(X) # Dispersion for real distribution ks = range(1,10) Wks = zeros(len(ks)) Wkbs = zeros(len(ks)) sk = zeros(len(ks)) for indk, k in enumerate(ks): mu, clusters = find_centers(X,k) Wks[indk] = np.log(Wk(mu, clusters)) # Create B reference datasets B = 10 BWkbs = zeros(B) for i in range(B): Xb = [] for n in range(len(X)): Xb.append([random.uniform(xmin,xmax), random.uniform(ymin,ymax)]) Xb = np.array(Xb) mu, clusters = find_centers(Xb,k) BWkbs[i] = np.log(Wk(mu, clusters)) Wkbs[indk] = sum(BWkbs)/B sk[indk] = np.sqrt(sum((BWkbs-Wkbs[indk])**2)/B) sk = sk*np.sqrt(1+1/B) return(ks, Wks, Wkbs, sk)

We shall now apply our algorithm to diverse distributions and see how it performs. Using the `init_board_gauss(N, k)`

function defined in our previous post, we produce an ensemble of 200 data points normally distributed around 3 centers and run the gap statistic on them.

X = init_board_gauss(200,3) ks, logWks, logWkbs, sk = gap_statistic(X)

The following plot gives an idea of what is happening:

The upper left plot shows the target distribution with 3 clusters. On the right is its bounding box and one Monte Carlo sample drawn from a uniform reference distribution within that rectangle. In the middle left we see the plot of that is used to determine with the elbow method. Indeed a knee-like feature is observed at , however the gap statistic is a better way of formalizing this phenomenon. On the right is the comparison of for the original and averaged reference distributions. Finally, the bottom plots show the gap quantity on the left, with a clear peak at the correct and the criteria for choosing it on the right. The correct is the smallest for which the quantity plotted in blue bars becomes positive. The optimal number is correctly guessed by the algorithm as .

Let us now have a look at another example with 400 points around 5 clusters:

In this case, the elbow method would not have been conclusive, however the gap statistic correctly shows a peak in the gap at and the bar plot changes sign at the same correct value.

Similarly, we can study what happens when the data points are clustered around a single centroid:

It is clear in the above figures that the original and the reference distributions in the middle right plot follow the same decay law, so that no abrupt fall-off of the blue curve with respect to the red one is observed at any . The bar plot shows positive values for the entire range. We conclude that is the correct clustering.

Finally, let us have a look at a uniform, non-clustered distribution of 200 points, generated with the `init_board(N)`

function defined in our previous post:

In this case, the algorithm also guesses correctly, and it is clear from the middle right plot that both the original and the reference distributions follow exactly the same decay law, since they are essentially different samples from the same uniform distribution on [-1,1] x [-1,1]. The gap curve on the bottom left oscillates between local maxima and minima, indicating certain structures within the original distribution originated by statistical fluctuations.

The estimation of the optimal number of clusters within a set of data points is a very important problem, as most clustering algorithms need that parameter as input in order to group the data. Many methods have been proposed to find the proper , among which the “elbow” method offers a very clear and naive solution based on intra-cluster variance. The gap statistic, proposed by Tobshirani *et al.* formalizes this approach and offers an easy-to-implement algorithm that successfully finds the correct in the case of globular, Gaussian-distributed, mildly disjoint data distributions.

**Update:** For a proper initialization of the centroids at the start of the k-means algorithm, we implement the improved k-means++ seeding procedure.

**Update:** For a comparison of this approach with an alternative method for finding the K in k-means clustering, read this article.

[Via]

]]>Data visualization plays a crucial role in the communication of results from data analyses, and it should always help transmit insights in an honest and clear way. Recently, the highly recommendable blog Flowing Data posted a review of data visualization highlights during 2013, and at The Data Science Lab we felt like doing a bit of pretty plotting as well.

For Python lovers, matplotlib is the library of choice when it comes to plotting. Quite conveniently, the data analysis library pandas comes equipped with useful wrappers around several matplotlib plotting routines, allowing for quick and handy plotting of data frames. Nice examples of plotting with pandas can be seen for instance in this ipython notebook. Still, for customized plots or not so typical visualizations, the panda wrappers need a bit of tweaking and playing with matplotlib’s inside machinery. If one is willing to devote a bit of time to google-ing and experimenting, very beautiful plots can emerge.

For this pre-Christmas data visualization table-top experiment we are going to use demographic data from countries in the European Union obtained from Wolfram|Alpha. Our data set contains information on population, extension and life expectancy in 24 European countries. We create a pandas data frame from three series that we simply construct from lists, setting the countries as index for each series, and consequently for the data frame.

import pandas as pd import matplotlib as mpl from matplotlib.colors import LinearSegmentedColormap from matplotlib.lines import Line2D countries = ['France','Spain','Sweden','Germany','Finland','Poland','Italy', 'United Kingdom','Romania','Greece','Bulgaria','Hungary', 'Portugal','Austria','Czech Republic','Ireland','Lithuania','Latvia', 'Croatia','Slovakia','Estonia','Denmark','Netherlands','Belgium'] extensions = [547030,504782,450295,357022,338145,312685,301340,243610,238391, 131940,110879,93028,92090,83871,78867,70273,65300,64589,56594, 49035,45228,43094,41543,30528] populations = [63.8,47,9.55,81.8,5.42,38.3,61.1,63.2,21.3,11.4,7.35, 9.93,10.7,8.44,10.6,4.63,3.28,2.23,4.38,5.49,1.34,5.61, 16.8,10.8] life_expectancies = [81.8,82.1,81.8,80.7,80.5,76.4,82.4,80.5,73.8,80.8,73.5, 74.6,79.9,81.1,77.7,80.7,72.1,72.2,77,75.4,74.4,79.4,81,80.5] data = {'extension' : pd.Series(extensions, index=countries), 'population' : pd.Series(populations, index=countries), 'life expectancy' : pd.Series(life_expectancies, index=countries)} df = pd.DataFrame(data) df = df.sort('life expectancy')

Now, thanks to the pandas plotting machinery, it is extremely straightforward to show the contents of this data frame by calling the `pd.plot`

function. The code below generates a figure with three subplots displayed vertically, each of which shows a bar plot for a particular column of the data frame. The plots are automatically labelled with the column names of the data frame, and the whole procedure takes literally no time.

fig, axes = plt.subplots(nrows=3, ncols=1) for i, c in enumerate(df.columns): df[c].plot(kind='bar', ax=axes[i], figsize=(12, 10), title=c) plt.savefig('EU1.png', bbox_inches='tight')

The output figure looks like this:

While this is an acceptable plot for the first steps of data exploration, the figure is not really publication-ready. It also looks very much “academic” and lacks that subtle flair that infographics in mainstream media have. Over the next paragraphs we will turn this plot into a much more beautiful object by playing around with the options that matplotlib supplies.

Let us first start by creating a figure and an axis object that will contain our subfigure:

# Create a figure of given size fig = plt.figure(figsize=(16,12)) # Add a subplot ax = fig.add_subplot(111) # Set title ttl = 'Population, size and age expectancy in the European Union'

Colors are very important for data visualizations. By default, the matplotlib color palette offers solid hues, which can be softened by applying transparencies. Similarly, the default colorbars can be customized to match our taste (see below how one can define a custom-made color map with a gradient that softly changes from orange to gray-blue hues).

# Set color transparency (0: transparent; 1: solid) a = 0.7 # Create a colormap customcmap = [(x/24.0, x/48.0, 0.05) for x in range(len(df))]

The main plotting instruction in our figure uses the pandas plot wrapper. In the initialization options, we specify the type of plot (horizontal bar), the transparency, the color of the bars following the above-defined custom color map, the x-axis limits and the figure title. We also set the color of the bar borders to white for a cleaner look.

# Plot the 'population' column as horizontal bar plot df['population'].plot(kind='barh', ax=ax, alpha=a, legend=False, color=customcmap, edgecolor='w', xlim=(0,max(df['population'])), title=ttl)

After this simple pandas plot directive, the figure already looks very promising. Note that, because we sorted the data frame by life expectancy and applied a gradient color map, the color of the different bars in itself carries information. We will explicitly label that information below when constructing a color bar. For now we want to remove the grid, frame and axes lines from our plot, as well as customize its title and x,y axes labels.

# Remove grid lines (dotted lines inside plot) ax.grid(False) # Remove plot frame ax.set_frame_on(False) # Pandas trick: remove weird dotted line on axis ax.lines[0].set_visible(False) # Customize title, set position, allow space on top of plot for title ax.set_title(ax.get_title(), fontsize=26, alpha=a, ha='left') plt.subplots_adjust(top=0.9) ax.title.set_position((0,1.08)) # Set x axis label on top of plot, set label text ax.xaxis.set_label_position('top') xlab = 'Population (in millions)' ax.set_xlabel(xlab, fontsize=20, alpha=a, ha='left') ax.xaxis.set_label_coords(0, 1.04) # Position x tick labels on top ax.xaxis.tick_top() # Remove tick lines in x and y axes ax.yaxis.set_ticks_position('none') ax.xaxis.set_ticks_position('none') # Customize x tick lables xticks = [5,10,20,50,80] ax.xaxis.set_ticks(xticks) ax.set_xticklabels(xticks, fontsize=16, alpha=a) # Customize y tick labels yticks = [item.get_text() for item in ax.get_yticklabels()] ax.set_yticklabels(yticks, fontsize=16, alpha=a) ax.yaxis.set_tick_params(pad=12)

So far, the lenghts of our horizontal bars display the population (in millions) of the EU countries. All bars have the same height (which is set to 50% of the total space between bars by default by pandas). An interesting idea is to use the height of the bars to display further data. If we could made the bar height dependent on, say, the countries’ extension, we would be adding an supplementary piece of information to the plot. This is possible in matplotlib by accessing the elements that contain the bars and assigning them a specific height in a `for`

loop. Each bar is an element of the class Rectangle, and all the corresponding class methods can be applied to it. For assigning a given height according to each country’s extension, we code a simple linear interpolation and create a `lambda`

function to apply it.

# Set bar height dependent on country extension # Set min and max bar thickness (from 0 to 1) hmin, hmax = 0.3, 0.9 xmin, xmax = min(df['extension']), max(df['extension']) # Function that interpolates linearly between hmin and hmax f = lambda x: hmin + (hmax-hmin)*(x-xmin)/(xmax-xmin) # Make array of heights hs = [f(x) for x in df['extension']] # Iterate over bars for container in ax.containers: # Each bar has a Rectangle element as child for i,child in enumerate(container.get_children()): # Reset the lower left point of each bar so that bar is centered child.set_y(child.get_y()- 0.125 + 0.5-hs[i]/2) # Attribute height to each Recatangle according to country's size plt.setp(child, height=hs[i])

Having added this “dimension” to the plot, we need a way of labelling the information so that the countries’ extension is understandable. A legend would be the ideal solution, but since our plotting directive was set to display the column `['population']`

, we can not use the default. We can construct a “fake” legend though, and custom-made its handles to roughly match the height of the bars. We position the legend in the lower right part of our plot.

# Legend # Create fake labels for legend l1 = Line2D([], [], linewidth=6, color='k', alpha=a) l2 = Line2D([], [], linewidth=12, color='k', alpha=a) l3 = Line2D([], [], linewidth=22, color='k', alpha=a) # Set three legend labels to be min, mean and max of countries extensions # (rounded up to 10k km2) rnd = 10000 labels = [str(int(round(l/rnd)*rnd)) for l in min(df['extension']), mean(df['extension']), max(df['extension'])] # Position legend in lower right part # Set ncol=3 for horizontally expanding legend leg = ax.legend([l1, l2, l3], labels, ncol=3, frameon=False, fontsize=16, bbox_to_anchor=[1.1, 0.11], handlelength=2, handletextpad=1, columnspacing=2, title='Size (in km2)') # Customize legend title # Set position to increase space between legend and labels plt.setp(leg.get_title(), fontsize=20, alpha=a) leg.get_title().set_position((0, 10)) # Customize transparency for legend labels [plt.setp(label, alpha=a) for label in leg.get_texts()]

Finally, there is another piece of information in the plot that needs to be labelled, and that is the color map indicating the average life expectancy in the EU countries. Since we used a custom-made color map, the regular call to `plt.colorbar()`

would not work. We need to create a LinearSegmentedColormap instead and “trick” matplotlib to display it as a colorbar. Then we can use the usual customization methods from `colorbar`

to set fonts, transparency, position and size of the diverse elements in the color legend.

# Create a fake colorbar ctb = LinearSegmentedColormap.from_list('custombar', customcmap, N=2048) # Trick from http://stackoverflow.com/questions/8342549/ # matplotlib-add-colorbar-to-a-sequence-of-line-plots sm = plt.cm.ScalarMappable(cmap=ctb, norm=plt.normalize(vmin=72, vmax=84)) # Fake up the array of the scalar mappable sm._A = [] # Set colorbar, aspect ratio cbar = plt.colorbar(sm, alpha=0.05, aspect=16, shrink=0.4) cbar.solids.set_edgecolor("face") # Remove colorbar container frame cbar.outline.set_visible(False) # Fontsize for colorbar ticklabels cbar.ax.tick_params(labelsize=16) # Customize colorbar tick labels mytks = range(72,86,2) cbar.set_ticks(mytks) cbar.ax.set_yticklabels([str(a) for a in mytks], alpha=a) # Colorbar label, customize fontsize and distance to colorbar cbar.set_label('Age expectancy (in years)', alpha=a, rotation=270, fontsize=20, labelpad=20) # Remove color bar tick lines, while keeping the tick labels cbarytks = plt.getp(cbar.ax.axes, 'yticklines') plt.setp(cbarytks, visible=False)

The final and most rewarding step consists of saving the figure in our preferred format.

# Save figure in png with tight bounding box plt.savefig('EU.png', bbox_inches='tight', dpi=300)

The final result looks this beautiful:

When producing a plot based on multidimensional data, it is a good idea to resort to shapes and colors that visually guide us through the variables on display. Matplotlib offers a high level of customization for all details of a plot, albeit the truth is that finding exactly which knob to tweak might be at times bewildering. Beautiful plots can be created by experimenting with various settings, among which hues, transparencies and simple layouts are the focal points. The results are publication-ready figures with open-source software that can be easily replicated by means of structured python code.

]]>Thus far, the courses offered by Coursera and edX could be understood as an extension of otherwise regular university courses, simply made available to students outside the classroom by means of technology. The materials did somehow exist in a similar form prior to being offered online; transferring them to a MOOC-appropriate format certainly involves extra overhead work from the lecturer, but the structure of the syllabus essentially remains unchanged.

But as an article in Forbes reports, an audacious player in the MOOC providers league, Udacity, is set to disrupt the market with its offers of coaching for students as well as verified certification for their final projects. As Michael Horn bluntly puts it:

But the real disruption in U.S. higher education was never going to come from slapping traditional courses online for free. That is mostly glorified edutainment—not a bad thing for humanity by any means and potentially a useful upgrade over a traditional textbook, but not disruptive to the higher education sector writ large in and of itself. The real disruption in higher education was always going to come from a new system that looks quite difrom the current one, begins by serving nonconsumers of traditional higher education, and integrates with employer needs to help students make progress in their lives because of an understanding that employers are ultimately—like it or not—the end customers for higher education because they ultimately finance much of the system for students.

Browsing Udacity’s offering in the Data Science track, one finds interesting videos and catchy trailers. We can choose among Introduction to Data Science, Data Wrangling with MongoDB, Introduction to Hadoop and MapReduce and some other courses, priced at around 100$-150$/month for a duration of 1-2 months.

I do not doubt that there is a market for Udacity’s offers, and I am sure that the quality of their mentoring and materials are worth the investment. However, let’s us not forget that there are other approaches to MOOCs, as this interview with Prof. Abu-Mostafa illustrates. He makes a very interesting and profound point, with which we at The Data Science Lab thoroughly agree:

Stick to your guns. Don’t water down the course to increase the numbers. Make the course as interesting as possible WITHOUT compromising the rigor and the content. What matters is what the students actually learn and retain. This is real education not a video game or a popularity contest.

What is, in your opinion, the best way to organize MOOCs in hot topics, such as Data Science, that attract tons of attention from media and aspiring practitioners alike?

]]>