# Selection of K in K-means Clustering, Reloaded

This article follows up on the series devoted to k-means clustering at The Data Science Lab. Previous posts have dealt with how to implement Lloyd’s algorithm for clustering in python, described an improved initialization algorithm for proper seeding of the initial clusters, k-means++, and introduced the gap statistic as a method of finding the optimal K for k-means clustering.

Although the gap statistic, based on a paper by Tibshirani et al was shown to find optimal values for the number of clusters in a variety of cases when the clusters where globular and mildly disjointed, its performance might be hampered by the need of perfoming Monte Carlo simulations to estimate the reference datasets. A reader of this blog, Jonathan Stray, pointed out a potentially superior method for selecting the K in k-means clustering, so let us implement it and compare.

### An alternative approach to finding the optimal K

The approach suggested by our reader is based on a publication by Pham, Dimov and Nguyen from 2004. The article is very much worth reading, as it includes an explanation of the drawbacks of the standard k-means algorithm as well as a comprehensive survey on different methods that have been proposed for selecting an optimal number of clusters.

In section 3 of the paper, the authors justify the introduction of a function $f(K)$ to evaluate the quality of the resulting clustering and help decide on the optimal value of $K$ for each data set. Quoting from the paper:

A data set with $n$ objects could be grouped into any number of clusters between 1 and $n$, which would correspond to the lowest and the highest levels of detail respectively. By specifying different $K$ values, it is possible to assess the results of grouping objects into various numbers of clusters. From this evaluation, more than one $K$ value could be recommended to users, but the ﬁnal selection is made by them.

The goal of a clustering algorithm is to identify regions in which the data points are concentrated. It is also important to analyze the internal distribution of each cluster as well as its relation to other clusters in the data set. The distorsion of a cluster is a measure of the distance between points in a cluster and its centroid:

$\displaystyle I_j = \sum_{\mathrm{x}_i \in C_j} ||\mathrm{x}_i - \mu_j ||^2$.

The global impact of all clusters’ distortions is given by the quantity

$\displaystyle S_k = \sum_{j=1}^K I_j$.

The authors Pham et al. proceed to discuss further constrains that the sought-after function $f(K)$ should verify for it to be informative to the problem of selection of K. They finally arrive at the following definition:

$N_d$ is the number of dimensions (attributes) of the data set and $\alpha_K$ is a weight factor. With this definition, $f(K)$ is the ratio of the real distortion to the estimated distortion and decreases when there are areas of concentration in the data distribution. Values of $K$ that yield small $f(K)$ can be regarded as giving well-deﬁned clusters.

### A python implementation of Pham et al. f(K)

Our implementation of the Pham et al. procedure builds on the KMeans and KPlusPlus python classes defined in our article on the k-means++ algorithm. We define a new class that inherits from KPlusPlus and contains a function to compute $f(K)$:

class DetK(KPlusPlus):
def fK(self, thisk, Skm1=0):
X = self.X
Nd = len(X[0])
a = lambda k, Nd: 1 - 3/(4*Nd) if k == 2 else a(k-1, Nd) + (1-a(k-1, Nd))/6
self.find_centers(thisk, method='++')
mu, clusters = self.mu, self.clusters
Sk = sum([np.linalg.norm(mu[i]-c)**2 \
for i in range(thisk) for c in clusters[i]])
if thisk == 1:
fs = 1
elif Skm1 == 0:
fs = 1
else:
fs = Sk/(a(thisk,Nd)*Skm1)
return fs, Sk


Note the recursive definition of $\alpha_K$ (variable a in the code snapshot above) and the fact that the computation of $S_K$ for $K > 1$ requires knowing the value of $S_{K-1}$, which is passed as input parameter to the function.

This article aims at showing that the Pham et al. procedure works and is computationally more efficient than the gap statistic. Therefore, we will code up the algorithm for the gap statistic within the same class DetK, so that we can run both procedures simultaneously. The full code is below the fold:

class DetK(KPlusPlus):
def fK(self, thisk, Skm1=0):
X = self.X
Nd = len(X[0])
a = lambda k, Nd: 1 - 3/(4*Nd) if k == 2 else a(k-1, Nd) + (1-a(k-1, Nd))/6
self.find_centers(thisk, method='++')
mu, clusters = self.mu, self.clusters
Sk = sum([np.linalg.norm(mu[i]-c)**2 \
for i in range(thisk) for c in clusters[i]])
if thisk == 1:
fs = 1
elif Skm1 == 0:
fs = 1
else:
fs = Sk/(a(thisk,Nd)*Skm1)
return fs, Sk

def _bounding_box(self):
X = self.X
xmin, xmax = min(X,key=lambda a:a[0])[0], max(X,key=lambda a:a[0])[0]
ymin, ymax = min(X,key=lambda a:a[1])[1], max(X,key=lambda a:a[1])[1]
return (xmin,xmax), (ymin,ymax)

def gap(self, thisk):
X = self.X
(xmin,xmax), (ymin,ymax) = self._bounding_box()
self.init_centers(thisk)
self.find_centers(thisk, method='++')
mu, clusters = self.mu, self.clusters
Wk = np.log(sum([np.linalg.norm(mu[i]-c)**2/(2*len(c)) \
for i in range(thisk) for c in clusters[i]]))
# Create B reference datasets
B = 10
BWkbs = zeros(B)
for i in range(B):
Xb = []
for n in range(len(X)):
Xb.append([random.uniform(xmin,xmax), \
random.uniform(ymin,ymax)])
Xb = np.array(Xb)
kb = DetK(thisk, X=Xb)
kb.init_centers(thisk)
kb.find_centers(thisk, method='++')
ms, cs = kb.mu, kb.clusters
BWkbs[i] = np.log(sum([np.linalg.norm(ms[j]-c)**2/(2*len(c)) \
for j in range(thisk) for c in cs[j]]))
Wkb = sum(BWkbs)/B
sk = np.sqrt(sum((BWkbs-Wkb)**2)/float(B))*np.sqrt(1+1/B)
return Wk, Wkb, sk

def run(self, maxk, which='both'):
ks = range(1,maxk)
fs = zeros(len(ks))
Wks,Wkbs,sks = zeros(len(ks)+1),zeros(len(ks)+1),zeros(len(ks)+1)
# Special case K=1
self.init_centers(1)
if which == 'f':
fs[0], Sk = self.fK(1)
elif which == 'gap':
Wks[0], Wkbs[0], sks[0] = self.gap(1)
else:
fs[0], Sk = self.fK(1)
Wks[0], Wkbs[0], sks[0] = self.gap(1)
# Rest of Ks
for k in ks[1:]:
self.init_centers(k)
if which == 'f':
fs[k-1], Sk = self.fK(k, Skm1=Sk)
elif which == 'gap':
Wks[k-1], Wkbs[k-1], sks[k-1] = self.gap(k)
else:
fs[k-1], Sk = self.fK(k, Skm1=Sk)
Wks[k-1], Wkbs[k-1], sks[k-1] = self.gap(k)
if which == 'f':
self.fs = fs
elif which == 'gap':
G = []
for i in range(len(ks)):
G.append((Wkbs-Wks)[i] - ((Wkbs-Wks)[i+1]-sks[i+1]))
self.G = np.array(G)
else:
self.fs = fs
G = []
for i in range(len(ks)):
G.append((Wkbs-Wks)[i] - ((Wkbs-Wks)[i+1]-sks[i+1]))
self.G = np.array(G)

def plot_all(self):
X = self.X
ks = range(1, len(self.fs)+1)
fig = plt.figure(figsize=(18,5))
# Plot 1
ax1.set_xlim(-1,1)
ax1.set_ylim(-1,1)
ax1.plot(zip(*X)[0], zip(*X)[1], '.', alpha=0.5)
tit1 = 'N=%s' % (str(len(X)))
ax1.set_title(tit1, fontsize=16)
# Plot 2
ax2.set_ylim(0, 1.25)
ax2.plot(ks, self.fs, 'ro-', alpha=0.6)
ax2.set_xlabel('Number of clusters K', fontsize=16)
ax2.set_ylabel('f(K)', fontsize=16)
foundfK = np.where(self.fs == min(self.fs))[0][0] + 1
tit2 = 'f(K) finds %s clusters' % (foundfK)
ax2.set_title(tit2, fontsize=16)
# Plot 3
ax3.bar(ks, self.G, alpha=0.5, color='g', align='center')
ax3.set_xlabel('Number of clusters K', fontsize=16)
ax3.set_ylabel('Gap', fontsize=16)
foundG = np.where(self.G > 0)[0][0] + 1
tit3 = 'Gap statistic finds %s clusters' % (foundG)
ax3.set_title(tit3, fontsize=16)
ax3.xaxis.set_ticks(range(1,len(ks)+1))
plt.savefig('detK_N%s.png' % (str(len(X))), \
bbox_inches='tight', dpi=100)


For a first experiment comparing the Pham et al. and the gap statistic approaches, we create a data set comprising 300 points around 2 Gaussian-distributed clusters. We run both methods to select $K$ spanning the values $K=1, \ldots, 9$. (The function run from class DetK takes a value $k_{thr}$ as input and checks all values such that $k < k_{thr}$.) Note that every run of the k-means clustering algorithm for different values of $K$ is preceded by the k-means++ initialization algorithm, to prevent landing at suboptimal clustering solutions.

To run a full comparison of both methods, the following simple commands are invoked:

kpp = DetK(2, N=300)
kpp.run(10)
kpp.plot_all()


This produces the following result plots:

According to Pham et al. lower values of $f(K)$, and especially values $f(K) < 0.85$ are an indication of cluster-like features in the data at that particular $K$. In the case of $K=2$, the global minimum of $f(K)$ in the central plot leaves no doubt that this is the right value to choose for this particular data configuration. The gap statistic, depicted in the plot on the right, yields the same result of $K=2$. Remember that the optimal $K$ with the gap statistic is the smallest value for which the gap quantity becomes positive.

Similarly, we can analyze a data set consisting of 100 points around a single cluster. The results are shown in the plots below. We observe how the function $f(K)$ does not show any prominent valley or value for which $f(K) < 0.85$ for any of the surveyed $K$s. According to the Pham et al. paper, this is an indication of no clustering, as is the case. The gap statistic agrees that there is no more than one cluster in this case.

Finally, let us look at two cases, both with 500 data points around 4 clusters. Below are the plots of the results:

For the data distribution on the top, one can see that the 4 clusters are positioned in such a way that they could also be interpreted as 2 clusters made of 2 subclusters each. The $f(K)$ detects this configuration and suggests 2 possible values of $K$, with a slight preference for $K=2$ over $K=4$. The gap statistic changes sign at $K=2$, albeit barely, and it does it again and more clearly at $K=4$. In both cases, a strict application of the rules prescribed to select the correct $K$ does lead to a rather suboptimal, or at least dubious, choice.

In the bottom plot however, the 4 clusters are somehow more evenly spreaded and both algorithms succeed at identifying $K=4$. The $f(K)$ method still shows a relative minimum at $K=2$, indicating a potentially alternative clustering.

### Performance comparison of f(K) and the gap statistic

If both methods to select the optimal $K$ for k-means clustering yield similar results, one should ask about the relative performance of them in real-life data science clustering problems. It is straightforward to predict that the gap statistic, with its need for running the k-means algorithm multiple times to create a Monte Carlo reference distribution, will necessarily be a poorer performer. We can easily test this hypothesis with our code by running both approaches and timing them using the IPython magic %time function. For a data set with $N = 500$:

%time kpp.run(10, which='f')


CPU times: user 2.72 s, sys: 0.00 s, total: 2.72 s Wall time: 2.90 s

%time kpp.run(10, which='gap')


CPU times: user 51.30 s, sys: 0.01 s, total: 51.31 s Wall time: 51.40 s

In this particular example, the $f(K)$ method is more than one order of magnitude more performant than the gap statistic, and this comparison looks worse for the latter the more data we take into consideration and the larger the number $B$ employed for generating the reference distributions.

### Table-top data experiment take-away message

The estimation of the optimal number of clusters within a set of data points is a very important problem, as most clustering algorithms need that parameter as input in order to group the data. Many methods have been proposed to find the proper $K$, among which the approach proposed by Pham et al. in 2004 seems to offer a very straightforward and performant solution. The estimation of the function $f(K)$ over the desired range of test values for $K$ offers an immediate way of assessing when the cluster-like features appear and allows to choose among a best value and other alternatives. A comparison in performance with the gap statistic method of Tibshirani et al. concludes that the $f(K)$ is computationally advantageous.

# Improved Seeding For Clustering With K-Means++

Clustering data into subsets is an important task for many data science applications. At The Data Science Lab we have illustrated how Lloyd’s algorithm for k-means clustering works, including snapshots of python code to visualize the iterative clustering steps. One of the issues with the procedure is that

this algorithm does not supply information as to which K for the k-means is optimal; that has to be found out by alternative methods,

so that we went a step further and coded up the gap statistic to find the proper k for k-means clustering. In combination with the clustering algorithm, the gap statistic allows to estimate the best value for k among those in a given range.

An additional problem with the standard k-means procedure still remains though, as shown by the image on the right, where a poor random initialization of the centroids leads to suboptimal clustering:

If the target distribution is disjointedly clustered and only one instantiation of Lloyd’s algorithm is used, the danger exists that the local minimum reached is not the optimal solution.

The initialization problem for the k-means algorithm is an important practical one, and has been discussed extensively. It is desirable to augment the standard k-means clustering procedure with a robust initialization mechanism that guarantees convergence to the optimal solution.

### k-means++: the advantages of careful seeding

A solution called k-means++ was proposed in 2007 by Arthur and Vassilvitskii. This algorithm comes with a theoretical guarantee to find a solution that is O(log k) competitive to the optimal k-means solution. It is also fairly simple to describe and implement. Starting with a dataset X of N points $(\mathrm{x}_1, \ldots, \mathrm{x}_N)$,

• choose an initial center $c_1$ uniformly at random from X. Compute the vector containing the square distances between all points in the dataset and $c_1$: $D_i^2 = ||\mathrm{x}_i - c_1 ||^2$
• choose a second center $c_2$ from X randomly drawn from the probability distribution $D_i^2 / \sum_j D_j^2$
• recompute the distance vector as $D_i^2 = \mathrm{min} \left(||\mathrm{x}_i - c_1 ||^2, ||\mathrm{x}_i - c_2 ||^2\right)$
• choose a successive center $c_l$ and recompute the distance vector as $D_i^2 = \mathrm{min} \left(||\mathrm{x}_i - c_1 ||^2, \ldots, ||\mathrm{x}_i - c_l ||^2\right)$
• when exactly k centers have been chosen, finalize the initialization phase and proceed with the standard k-means algorithm

The interested reader can find a review of the k-means++ algorithm at normaldeviate, a survey of implementations in several languages at rosettacode and a ready-to-use solution in pandas by Jack Maney in github.

### A python implementation of the k-means++ algorithm

Out python implementation of the k-means++ algorithm builds on the code for standard k-means shown in the previous post. The KMeans class defined below contains all necessary functions and methods to generate toy data and run the Lloyd’s clustering algorithm on it:

class KMeans():
def __init__(self, K, X=None, N=0):
self.K = K
if X == None:
if N == 0:
raise Exception("If no data is provided, \
a parameter N (number of points) is needed")
else:
self.N = N
self.X = self._init_board_gauss(N, K)
else:
self.X = X
self.N = len(X)
self.mu = None
self.clusters = None
self.method = None

def _init_board_gauss(self, N, k):
n = float(N)/k
X = []
for i in range(k):
c = (random.uniform(-1,1), random.uniform(-1,1))
s = random.uniform(0.05,0.15)
x = []
while len(x) < n:
a,b = np.array([np.random.normal(c[0],s),np.random.normal(c[1],s)])
# Continue drawing points from the distribution in the range [-1,1]
if abs(a) and abs(b)<1:
x.append([a,b])
X.extend(x)
X = np.array(X)[:N]
return X

def plot_board(self):
X = self.X
fig = plt.figure(figsize=(5,5))
plt.xlim(-1,1)
plt.ylim(-1,1)
if self.mu and self.clusters:
mu = self.mu
clus = self.clusters
K = self.K
for m, clu in clus.items():
cs = cm.spectral(1.*m/self.K)
plt.plot(mu[m][0], mu[m][1], 'o', marker='*', \
markersize=12, color=cs)
plt.plot(zip(*clus[m])[0], zip(*clus[m])[1], '.', \
markersize=8, color=cs, alpha=0.5)
else:
plt.plot(zip(*X)[0], zip(*X)[1], '.', alpha=0.5)
if self.method == '++':
tit = 'K-means++'
else:
tit = 'K-means with random initialization'
pars = 'N=%s, K=%s' % (str(self.N), str(self.K))
plt.title('\n'.join([pars, tit]), fontsize=16)
plt.savefig('kpp_N%s_K%s.png' % (str(self.N), str(self.K)), \
bbox_inches='tight', dpi=200)

def _cluster_points(self):
mu = self.mu
clusters  = {}
for x in self.X:
bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \
for i in enumerate(mu)], key=lambda t:t[1])[0]
try:
clusters[bestmukey].append(x)
except KeyError:
clusters[bestmukey] = [x]
self.clusters = clusters

def _reevaluate_centers(self):
clusters = self.clusters
newmu = []
keys = sorted(self.clusters.keys())
for k in keys:
newmu.append(np.mean(clusters[k], axis = 0))
self.mu = newmu

def _has_converged(self):
K = len(self.oldmu)
return(set([tuple(a) for a in self.mu]) == \
set([tuple(a) for a in self.oldmu])\
and len(set([tuple(a) for a in self.mu])) == K)

def find_centers(self, method='random'):
self.method = method
X = self.X
K = self.K
self.oldmu = random.sample(X, K)
if method != '++':
# Initialize to K random centers
self.mu = random.sample(X, K)
while not self._has_converged():
self.oldmu = self.mu
# Assign all points in X to clusters
self._cluster_points()
# Reevaluate centers
self._reevaluate_centers()


To initalize the board with n data points normally distributed around k centers, we call kmeans = KMeans(k, N=n).

kmeans = KMeans(3, N=200)
kmeans.find_centers()
kmeans.plot_board()


The snippet above creates a board with 200 points around 3 clusters. The call to the find_centers() function runs the standard k-means algorithm initializing the centroids to 3 random points. Finally, the function plot_board() produces a plot of the data points as clustered by the algorithm, with the centroids marked as stars. In the image below we can see the results of running the algorithm twice. Due to the random initialization of the standard k-means, the correct solution is found some of the times (right panel) whereas in some cases a suboptimal end point is reached instead (left panel).

Let us now implement the k-means++ algorithm in its own class, which inherits from the class Kmeans defined above.

class KPlusPlus(KMeans):
def _dist_from_centers(self):
cent = self.mu
X = self.X
D2 = np.array([min([np.linalg.norm(x-c)**2 for c in cent]) for x in X])
self.D2 = D2

def _choose_next_center(self):
self.probs = self.D2/self.D2.sum()
self.cumprobs = self.probs.cumsum()
r = random.random()
ind = np.where(self.cumprobs >= r)[0][0]
return(self.X[ind])

def init_centers(self):
self.mu = random.sample(self.X, 1)
while len(self.mu) < self.K:
self._dist_from_centers()
self.mu.append(self._choose_next_center())

def plot_init_centers(self):
X = self.X
fig = plt.figure(figsize=(5,5))
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.plot(zip(*X)[0], zip(*X)[1], '.', alpha=0.5)
plt.plot(zip(*self.mu)[0], zip(*self.mu)[1], 'ro')
plt.savefig('kpp_init_N%s_K%s.png' % (str(self.N),str(self.K)), \
bbox_inches='tight', dpi=200)


To run the k-means++ initialization stage using this class and visualize the centers found by the algorithm, we simply do:

kplusplus = KPlusPlus(5, N=200)
kplusplus.init_centers()
kplusplus.plot_init_centers()


Let us explore what the function init_centers() is actually doing: to begin with, a random point is chosen as first center from the X data points as random.sample(self.X, 1). Then, the successive centers are picked, stopping when we have K=5 of them. The procedure to choose the next most suitable center is coded up in the _choose_next_center() function. As we described above, the next center is drawn from a distribution given by the normalized distance vector $D_i^2 / \sum_j D_j^2$. To implement such a probability distribution, we compute the cumulative probabilities for choosing each of the N points in X. These cumulative probabilities are partitions in the interval [0,1] with length equal to the probability of the corresponding point being chosen as a center, as explained in this stackoverflow thread. Therefore, by picking a random value $r \in [0,1]$ and finding the point corresponding to the segment of the partition where that $r$ value falls, we are effectively choosing a point drawn according to the desired probability distribution. On the right is a plot showing the results of the algorithm for 200 points and 5 clusters.

Finally let us compare the results of k-means with random initialization and k-means++ with proper seeding, using the following code snippets:

# Random initialization
kplusplus.find_centers()
kplusplus.plot_board()
# k-means++ initialization
kplusplus.find_centers(method='++')
kplusplus.plot_board()


The standard algorithm with random initialization in a particular instantiation (left panel) fails at identifying the 5 optimal centroids for the clustering, whereas the k-means++ initialization (right panel) succeeds in doing so. By picking up a specific and not random set of centroids to initiate the clustering process, the k-means++ algorithm also reaches convergence faster, guaranteed by the theorems proved in the Arthur and Vassilvitskii article.

### Table-top data experiment take-away message

The k-means++ method for finding a proper seeding for the choice of initial centroids yields considerable improvement over the standard Lloyd’s implementation of the k-means algorithm. The initial selection in k-means++ takes extra time and involves choosing centers in a successive order and drawing them from a particular probability distribution that has to be recomputed at each step. However, by doing so, the k-means part of the algorithm converges very quickly after this seeding and thus the whole procedure actually runs in a shorter computation time. The combination of the k-means++ initialization stage with the standard Lloyd’s algorithm, together with additional various techniques to find out an optimal value for the ideal number of clusters, poses a robust way to solve the complete problem of clustering data points.

# Machine Learning Classics: The Perceptron

An important problem in statistics and machine learning is that of classification, which is the task of identifying the category that an observation belongs to, on the basis of a training set of data containing other instances.

In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering or cluster analysis, and involves grouping data into categories based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional vector space). (Wikipedia)

At The Data Science Lab, we have already reviewed some basics of unsupervised classification with the Lloyd algorithm for k-means clustering and have investigated how to find the appropriate number of clusters. Today’s post will be devoted to a classical machine learning algorithm for supervised classification: the perceptron learning algorithm.

### Theory behind the perceptron

The perceptron learning algorithm was invented in the late 1950s by Frank Rosenblatt. It belongs to the class of linear classifiers, this is, for a data set classified according to binary categories (which we will assume to be labeled +1 and -1), the classifier seeks to divide the two classes by a linear separator. The separator is a (n-1)-dimensional hyperplane in a n-dimensional space, in particular it is a line in the plane and a plane in the 3-dimensional space.

Our data set will be assumed to consist of N observations characterized by d features or attributes, $\mathbf{x}_n = (x_1, \ldots, x_d)$ for $n = (1, \ldots, N)$. The problem of binary classifying these data points can be translated to that of finding a series of weights $w_i$ such that all vectors verifying

$\displaystyle \sum_{i=1}^d w_i x_i < b$

are assigned to one of the classes whereas those verifying

$\displaystyle \sum_{i=1}^d w_i x_i > b$

are assigned to the other, for a given threshold value $b$. If we rename $b = w_0$ and introduce an artificial coordinate $x_0 = 1$ in our vectors $\mathbf{x}_n$, we can write the perceptron separator formula as

$\displaystyle h(\mathbf{x}) = \mathrm{sign}\left(\sum_{i=0}^d w_i x_i\right) = \mathrm{sign}\left( \mathbf{w}^{\mathbf{T}}\mathbf{x}\right)$

Note that $\mathbf{w}^{\mathbf{T}}\mathbf{x}$ is notation for the scalar product between vectors $\mathbf{w}$ and $\mathbf{x}$. Thus the problem of classifying is that of finding the vector of weights $\mathbf{w}$ given a training data set of N vectors $\mathbf{x}$ with their corresponding labeled classification vector $(y_1, \ldots, y_N)$.

### The perceptron learning algorithm (PLA)

The learning algorithm for the perceptron is online, meaning that instead of considering the entire data set at the same time, it only looks at one example at a time, processes it and goes on to the next one. The algorithm starts with a guess for the vector $\mathbf{w}$ (without loss of generalization one can begin with a vector of zeros). It then assesses how good of a guess that is by comparing the predicted labels with the actual, correct labels (remember that those are available for the training test, since we are doing supervised learning). As long as there are misclassified points, the algorithm corrects its guess for the weight vector by updating the weights in the correct direction, until all points are correctly classified.

That direction is as follows: given a labeled training data set, if $\mathbf{w}$ is the guessed weight vector and $\mathbf{x}_n$ is an incorrectly classified point with $\mathbf{w}^{\mathbf{T}}\mathbf{x}_n \neq y_n$, then the weight $\mathbf{w}$ is updated to $\mathbf{w} + y_n \mathbf{x}_n$. This is illustrated in the plot on the right, taken from this clear article on the perceptron.

A nice feature of the perceptron learning rule is that if there exist a set of weights that solve the problem (i.e. if the data is linearly separable), then the perceptron will find these weights.

### A python implementation of the perceptron

For our python implementation we will use a trivial example on two dimensions: within the $[-1,1]\times[-1,1]$ space, we define two random points and draw the line that joins them. The general equation of a line given two points in it, $(x_1, y_1)$ and $(x_2, y_2)$, is $A + Bx + Cy = 0$ where $A, B, C$ can be written in terms of the two points. Defining a vector $\mathrm{V} = (A, B, C)$, any point $(x,y)$ belongs to the line if $\mathrm{V}^\mathrm{T}\mathrm{x} = 0$, where $\mathrm{x} = (1,x,y)$. Points for which the dot product is positive fall on one side of the line, negatives fall on the other.

This procedure automatically divides the plane linearly in two regions. We randomly choose N points in this space and classify them as +1 or -1 according to the dividing line defined before. The Perceptron class defined below is initialized exactly in this way. The perceptron learning algorithm is implemented in the pla function, and the classification error, defined as the fraction of misclassified points, is in classification_error. The code is as follows:

import numpy as np
import random
import os, subprocess

class Perceptron:
def __init__(self, N):
# Random linearly separated data
xA,yA,xB,yB = [random.uniform(-1, 1) for i in range(4)]
self.V = np.array([xB*yA-xA*yB, yB-yA, xA-xB])
self.X = self.generate_points(N)

def generate_points(self, N):
X = []
for i in range(N):
x1,x2 = [random.uniform(-1, 1) for i in range(2)]
x = np.array([1,x1,x2])
s = int(np.sign(self.V.T.dot(x)))
X.append((x, s))
return X

def plot(self, mispts=None, vec=None, save=False):
fig = plt.figure(figsize=(5,5))
plt.xlim(-1,1)
plt.ylim(-1,1)
V = self.V
a, b = -V[1]/V[2], -V[0]/V[2]
l = np.linspace(-1,1)
plt.plot(l, a*l+b, 'k-')
cols = {1: 'r', -1: 'b'}
for x,s in self.X:
plt.plot(x[1], x[2], cols[s]+'o')
if mispts:
for x,s in mispts:
plt.plot(x[1], x[2], cols[s]+'.')
if vec != None:
aa, bb = -vec[1]/vec[2], -vec[0]/vec[2]
plt.plot(l, aa*l+bb, 'g-', lw=2)
if save:
if not mispts:
plt.title('N = %s' % (str(len(self.X))))
else:
plt.title('N = %s with %s test points' \
% (str(len(self.X)),str(len(mispts))))
plt.savefig('p_N%s' % (str(len(self.X))), \
dpi=200, bbox_inches='tight')

def classification_error(self, vec, pts=None):
# Error defined as fraction of misclassified points
if not pts:
pts = self.X
M = len(pts)
n_mispts = 0
for x,s in pts:
if int(np.sign(vec.T.dot(x))) != s:
n_mispts += 1
error = n_mispts / float(M)
return error

def choose_miscl_point(self, vec):
# Choose a random point among the misclassified
pts = self.X
mispts = []
for x,s in pts:
if int(np.sign(vec.T.dot(x))) != s:
mispts.append((x, s))
return mispts[random.randrange(0,len(mispts))]

def pla(self, save=False):
# Initialize the weigths to zeros
w = np.zeros(3)
X, N = self.X, len(self.X)
it = 0
# Iterate until all points are correctly classified
while self.classification_error(w) != 0:
it += 1
# Pick random misclassified point
x, s = self.choose_miscl_point(w)
# Update weights
w += s*x
if save:
self.plot(vec=w)
plt.title('N = %s, Iteration %s\n' \
% (str(N),str(it)))
plt.savefig('p_N%s_it%s' % (str(N),str(it)), \
dpi=200, bbox_inches='tight')
self.w = w

def check_error(self, M, vec):
check_pts = self.generate_points(M)
return self.classification_error(vec, pts=check_pts)


To start a run of the perceptron with 20 data points and visualize the initial configuration we simply initialize the Perceptron class and call the plot function:

p = Perceptron(20)
p.plot()


On the right is the plane that we obtain, divided in two by the black line. Red points are labeled as +1 while blue ones are -1. The purpose of the perceptron learning algorithm is to “learn” a linear classifier that correctly separates red from blue points given the labeled set of 20 points shown in the figure. This is, we want to learn the black line as faithfully as possible.

The call to p.pla() runs the algorithm and stores the final weights learned in p.w. To save a plot of each of the iterations, we can use the option p.pla(save=True). The following snippet will concatenate all images together to produce an animated gif of the running algorithm:

basedir = '/my/output/directory/'
os.chdir(basedir)
pngs = [pl for pl in os.listdir(basedir) if pl.endswith('png')]
sortpngs = sorted(pngs, key=lambda a:int(a.split('it')[1][:-4]))
basepng = pngs[0][:-8]
[sortpngs.append(sortpngs[-1]) for i in range(4)]
comm = ("convert -delay 50 %s %s.gif" % (' '.join(sortpngs), basepng)).split()
proc = subprocess.Popen(comm, stdin = subprocess.PIPE, stdout = subprocess.PIPE)
(out, err) = proc.communicate()


Below we can see how the algorithm tries successive values for the weight vector, leading to a succession of guesses for the linear separator, which are plotted in green. For as long as the green line misclassifies points (meaning that it does not separate the blue from the right points correctly), the perceptron keeps updating the weights in the manner described above. Eventually all points are on the correct side of the guessed line, the classification error in the training data set ($E_{in} = 0$ for the in-sample points) is thus zero and the algorithm converges and stops.

Clearly the final guessed green line after the 7 iterations does separate the training data points correctly but it does not completely agree with the target black line. An error in classifying not-seen data points is bound to exist ($E_{out} \neq 0$ for out-of-sample points), and we can quantify it easily by evaluating the performance of the linear classifier on fresh, unseen data points:

p.plot(p.generate_points(500), p.w, save=True)


In this image we can observe how, even if $E_{in} = 0$ for the training points represented by the large dots, $E_{out} \neq 0$, as shown by the small red and blue dots that fall on one side of the black (target) line but are incorrectly classified by the green (guessed) one. The exact out-of-sample error is given by the area between both lines. This can be thus computed analytically and exactly. But it can also be estimated with a repeated Monte Carlo sampling:

err = []
for i in range(100):
err.append(p.check_error(500, p.w))
np.mean(err)


0.0598200

The perceptron algorithm has thus learned a linear binary classifier that correctly classifies data in 94% of the cases, having an out-of-sample error rate of 6%.

### Table-top data experiment take-away message

The perceptron learning algorithm is a classical example of binary linear supervised classifier. Its implementation involves finding a linear boundary that completely separates points belonging to the two classes. If the data is linearly separable, then the procedure will converge to a weight vector that separates the data. (And if the data is inseparable, the PLA will never converge.) The perceptron is thus fundamentally limited in that its decision boundaries can only be linear. Eventually we will explore other methods that overcome this limitation, either combining multiple perceptrons in a single framework (neural networks) or by mapping features in an efficient way (kernels).