# Machine Learning Classics: The Perceptron

An important problem in statistics and machine learning is that of classification, which is the task of identifying the category that an observation belongs to, on the basis of a training set of data containing other instances.

In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering or cluster analysis, and involves grouping data into categories based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional vector space). (Wikipedia)

At The Data Science Lab, we have already reviewed some basics of *unsupervised* classification with the Lloyd algorithm for k-means clustering and have investigated how to find the appropriate number of clusters. Today’s post will be devoted to a classical machine learning algorithm for *supervised* classification: the perceptron learning algorithm.

### Theory behind the perceptron

The perceptron learning algorithm was invented in the late 1950s by Frank Rosenblatt. It belongs to the class of linear classifiers, this is, for a data set classified according to binary categories (which we will assume to be labeled +1 and -1), the classifier seeks to divide the two classes by a linear separator. The separator is a *(n-1)*-dimensional hyperplane in a *n*-dimensional space, in particular it is a line in the plane and a plane in the 3-dimensional space.

Our data set will be assumed to consist of *N* observations characterized by *d* features or attributes, for . The problem of binary classifying these data points can be translated to that of finding a series of weights such that all vectors verifying

are assigned to one of the classes whereas those verifying

are assigned to the other, for a given threshold value . If we rename and introduce an artificial coordinate in our vectors , we can write the perceptron separator formula as

Note that is notation for the scalar product between vectors and . Thus the problem of classifying is that of finding the vector of weights given a training data set of *N* vectors with their corresponding labeled classification vector .

### The perceptron learning algorithm (PLA)

The learning algorithm for the perceptron is online, meaning that instead of considering the entire data set at the same time, it only looks at one example at a time, processes it and goes on to the next one. The algorithm starts with a guess for the vector (without loss of generalization one can begin with a vector of zeros). It then assesses how good of a guess that is by comparing the predicted labels with the actual, correct labels (remember that those are available for the training test, since we are doing supervised learning). As long as there are misclassified points, the algorithm corrects its guess for the weight vector by updating the weights in the correct direction, until all points are correctly classified.

That direction is as follows: given a labeled training data set, if is the guessed weight vector and is an incorrectly classified point with , then the weight is updated to . This is illustrated in the plot on the right, taken from this clear article on the perceptron.

A nice feature of the perceptron learning rule is that if there exist a set of weights that solve the problem (i.e. if the data is linearly separable), then the perceptron will find these weights.

### A python implementation of the perceptron

For our python implementation we will use a trivial example on two dimensions: within the space, we define two random points and draw the line that joins them. The general equation of a line given two points in it, and , is where can be written in terms of the two points. Defining a vector , any point belongs to the line if , where . Points for which the dot product is positive fall on one side of the line, negatives fall on the other.

This procedure automatically divides the plane linearly in two regions. We randomly choose *N* points in this space and classify them as +1 or -1 according to the dividing line defined before. The `Perceptron`

class defined below is initialized exactly in this way. The perceptron learning algorithm is implemented in the `pla`

function, and the classification error, defined as the fraction of misclassified points, is in `classification_error`

. The code is as follows:

import numpy as np import random import os, subprocess class Perceptron: def __init__(self, N): # Random linearly separated data xA,yA,xB,yB = [random.uniform(-1, 1) for i in range(4)] self.V = np.array([xB*yA-xA*yB, yB-yA, xA-xB]) self.X = self.generate_points(N) def generate_points(self, N): X = [] for i in range(N): x1,x2 = [random.uniform(-1, 1) for i in range(2)] x = np.array([1,x1,x2]) s = int(np.sign(self.V.T.dot(x))) X.append((x, s)) return X def plot(self, mispts=None, vec=None, save=False): fig = plt.figure(figsize=(5,5)) plt.xlim(-1,1) plt.ylim(-1,1) V = self.V a, b = -V[1]/V[2], -V[0]/V[2] l = np.linspace(-1,1) plt.plot(l, a*l+b, 'k-') cols = {1: 'r', -1: 'b'} for x,s in self.X: plt.plot(x[1], x[2], cols[s]+'o') if mispts: for x,s in mispts: plt.plot(x[1], x[2], cols[s]+'.') if vec != None: aa, bb = -vec[1]/vec[2], -vec[0]/vec[2] plt.plot(l, aa*l+bb, 'g-', lw=2) if save: if not mispts: plt.title('N = %s' % (str(len(self.X)))) else: plt.title('N = %s with %s test points' \ % (str(len(self.X)),str(len(mispts)))) plt.savefig('p_N%s' % (str(len(self.X))), \ dpi=200, bbox_inches='tight') def classification_error(self, vec, pts=None): # Error defined as fraction of misclassified points if not pts: pts = self.X M = len(pts) n_mispts = 0 for x,s in pts: if int(np.sign(vec.T.dot(x))) != s: n_mispts += 1 error = n_mispts / float(M) return error def choose_miscl_point(self, vec): # Choose a random point among the misclassified pts = self.X mispts = [] for x,s in pts: if int(np.sign(vec.T.dot(x))) != s: mispts.append((x, s)) return mispts[random.randrange(0,len(mispts))] def pla(self, save=False): # Initialize the weigths to zeros w = np.zeros(3) X, N = self.X, len(self.X) it = 0 # Iterate until all points are correctly classified while self.classification_error(w) != 0: it += 1 # Pick random misclassified point x, s = self.choose_miscl_point(w) # Update weights w += s*x if save: self.plot(vec=w) plt.title('N = %s, Iteration %s\n' \ % (str(N),str(it))) plt.savefig('p_N%s_it%s' % (str(N),str(it)), \ dpi=200, bbox_inches='tight') self.w = w def check_error(self, M, vec): check_pts = self.generate_points(M) return self.classification_error(vec, pts=check_pts)

To start a run of the perceptron with 20 data points and visualize the initial configuration we simply initialize the `Perceptron`

class and call the `plot`

function:

p = Perceptron(20) p.plot()

On the right is the plane that we obtain, divided in two by the black line. Red points are labeled as +1 while blue ones are -1. The purpose of the perceptron learning algorithm is to “learn” a linear classifier that correctly separates red from blue points given the labeled set of 20 points shown in the figure. This is, we want to learn the black line as faithfully as possible.

The call to `p.pla()`

runs the algorithm and stores the final weights learned in `p.w`

. To save a plot of each of the iterations, we can use the option `p.pla(save=True)`

. The following snippet will concatenate all images together to produce an animated gif of the running algorithm:

basedir = '/my/output/directory/' os.chdir(basedir) pngs = [pl for pl in os.listdir(basedir) if pl.endswith('png')] sortpngs = sorted(pngs, key=lambda a:int(a.split('it')[1][:-4])) basepng = pngs[0][:-8] [sortpngs.append(sortpngs[-1]) for i in range(4)] comm = ("convert -delay 50 %s %s.gif" % (' '.join(sortpngs), basepng)).split() proc = subprocess.Popen(comm, stdin = subprocess.PIPE, stdout = subprocess.PIPE) (out, err) = proc.communicate()

Below we can see how the algorithm tries successive values for the weight vector, leading to a succession of guesses for the linear separator, which are plotted in green. For as long as the green line misclassifies points (meaning that it does not separate the blue from the right points correctly), the perceptron keeps updating the weights in the manner described above. Eventually all points are on the correct side of the guessed line, the classification error in the training data set ( for the in-sample points) is thus zero and the algorithm converges and stops.

Clearly the final guessed green line after the 7 iterations does separate the training data points correctly but it does not completely agree with the target black line. An error in classifying not-seen data points is bound to exist ( for out-of-sample points), and we can quantify it easily by evaluating the performance of the linear classifier on fresh, unseen data points:

p.plot(p.generate_points(500), p.w, save=True)

In this image we can observe how, even if for the training points represented by the large dots, , as shown by the small red and blue dots that fall on one side of the black (target) line but are incorrectly classified by the green (guessed) one. The exact out-of-sample error is given by the area between both lines. This can be thus computed analytically and exactly. But it can also be estimated with a repeated Monte Carlo sampling:

err = [] for i in range(100): err.append(p.check_error(500, p.w)) np.mean(err)

`0.0598200`

The perceptron algorithm has thus learned a linear binary classifier that correctly classifies data in 94% of the cases, having an out-of-sample error rate of 6%.

### Table-top data experiment take-away message

The perceptron learning algorithm is a classical example of binary linear supervised classifier. Its implementation involves finding a linear boundary that completely separates points belonging to the two classes. If the data is linearly separable, then the procedure will converge to a weight vector that separates the data. (And if the data is inseparable, the PLA will never converge.) The perceptron is thus fundamentally limited in that its decision boundaries can only be linear. Eventually we will explore other methods that overcome this limitation, either combining multiple perceptrons in a single framework (neural networks) or by mapping features in an efficient way (kernels).