# Machine Learning Classics: The Perceptron

An important problem in statistics and machine learning is that of classification, which is the task of identifying the category that an observation belongs to, on the basis of a training set of data containing other instances.

In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering or cluster analysis, and involves grouping data into categories based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional vector space). (Wikipedia)

At The Data Science Lab, we have already reviewed some basics of unsupervised classification with the Lloyd algorithm for k-means clustering and have investigated how to find the appropriate number of clusters. Today’s post will be devoted to a classical machine learning algorithm for supervised classification: the perceptron learning algorithm.

### Theory behind the perceptron

The perceptron learning algorithm was invented in the late 1950s by Frank Rosenblatt. It belongs to the class of linear classifiers, this is, for a data set classified according to binary categories (which we will assume to be labeled +1 and -1), the classifier seeks to divide the two classes by a linear separator. The separator is a (n-1)-dimensional hyperplane in a n-dimensional space, in particular it is a line in the plane and a plane in the 3-dimensional space.

Our data set will be assumed to consist of N observations characterized by d features or attributes, $\mathbf{x}_n = (x_1, \ldots, x_d)$ for $n = (1, \ldots, N)$. The problem of binary classifying these data points can be translated to that of finding a series of weights $w_i$ such that all vectors verifying

$\displaystyle \sum_{i=1}^d w_i x_i < b$

are assigned to one of the classes whereas those verifying

$\displaystyle \sum_{i=1}^d w_i x_i > b$

are assigned to the other, for a given threshold value $b$. If we rename $b = w_0$ and introduce an artificial coordinate $x_0 = 1$ in our vectors $\mathbf{x}_n$, we can write the perceptron separator formula as

$\displaystyle h(\mathbf{x}) = \mathrm{sign}\left(\sum_{i=0}^d w_i x_i\right) = \mathrm{sign}\left( \mathbf{w}^{\mathbf{T}}\mathbf{x}\right)$

Note that $\mathbf{w}^{\mathbf{T}}\mathbf{x}$ is notation for the scalar product between vectors $\mathbf{w}$ and $\mathbf{x}$. Thus the problem of classifying is that of finding the vector of weights $\mathbf{w}$ given a training data set of N vectors $\mathbf{x}$ with their corresponding labeled classification vector $(y_1, \ldots, y_N)$.

### The perceptron learning algorithm (PLA)

The learning algorithm for the perceptron is online, meaning that instead of considering the entire data set at the same time, it only looks at one example at a time, processes it and goes on to the next one. The algorithm starts with a guess for the vector $\mathbf{w}$ (without loss of generalization one can begin with a vector of zeros). It then assesses how good of a guess that is by comparing the predicted labels with the actual, correct labels (remember that those are available for the training test, since we are doing supervised learning). As long as there are misclassified points, the algorithm corrects its guess for the weight vector by updating the weights in the correct direction, until all points are correctly classified.

That direction is as follows: given a labeled training data set, if $\mathbf{w}$ is the guessed weight vector and $\mathbf{x}_n$ is an incorrectly classified point with $\mathbf{w}^{\mathbf{T}}\mathbf{x}_n \neq y_n$, then the weight $\mathbf{w}$ is updated to $\mathbf{w} + y_n \mathbf{x}_n$. This is illustrated in the plot on the right, taken from this clear article on the perceptron.

A nice feature of the perceptron learning rule is that if there exist a set of weights that solve the problem (i.e. if the data is linearly separable), then the perceptron will find these weights.

### A python implementation of the perceptron

For our python implementation we will use a trivial example on two dimensions: within the $[-1,1]\times[-1,1]$ space, we define two random points and draw the line that joins them. The general equation of a line given two points in it, $(x_1, y_1)$ and $(x_2, y_2)$, is $A + Bx + Cy = 0$ where $A, B, C$ can be written in terms of the two points. Defining a vector $\mathrm{V} = (A, B, C)$, any point $(x,y)$ belongs to the line if $\mathrm{V}^\mathrm{T}\mathrm{x} = 0$, where $\mathrm{x} = (1,x,y)$. Points for which the dot product is positive fall on one side of the line, negatives fall on the other.

This procedure automatically divides the plane linearly in two regions. We randomly choose N points in this space and classify them as +1 or -1 according to the dividing line defined before. The Perceptron class defined below is initialized exactly in this way. The perceptron learning algorithm is implemented in the pla function, and the classification error, defined as the fraction of misclassified points, is in classification_error. The code is as follows:

import numpy as np
import random
import os, subprocess

class Perceptron:
def __init__(self, N):
# Random linearly separated data
xA,yA,xB,yB = [random.uniform(-1, 1) for i in range(4)]
self.V = np.array([xB*yA-xA*yB, yB-yA, xA-xB])
self.X = self.generate_points(N)

def generate_points(self, N):
X = []
for i in range(N):
x1,x2 = [random.uniform(-1, 1) for i in range(2)]
x = np.array([1,x1,x2])
s = int(np.sign(self.V.T.dot(x)))
X.append((x, s))
return X

def plot(self, mispts=None, vec=None, save=False):
fig = plt.figure(figsize=(5,5))
plt.xlim(-1,1)
plt.ylim(-1,1)
V = self.V
a, b = -V[1]/V[2], -V[0]/V[2]
l = np.linspace(-1,1)
plt.plot(l, a*l+b, 'k-')
cols = {1: 'r', -1: 'b'}
for x,s in self.X:
plt.plot(x[1], x[2], cols[s]+'o')
if mispts:
for x,s in mispts:
plt.plot(x[1], x[2], cols[s]+'.')
if vec != None:
aa, bb = -vec[1]/vec[2], -vec[0]/vec[2]
plt.plot(l, aa*l+bb, 'g-', lw=2)
if save:
if not mispts:
plt.title('N = %s' % (str(len(self.X))))
else:
plt.title('N = %s with %s test points' \
% (str(len(self.X)),str(len(mispts))))
plt.savefig('p_N%s' % (str(len(self.X))), \
dpi=200, bbox_inches='tight')

def classification_error(self, vec, pts=None):
# Error defined as fraction of misclassified points
if not pts:
pts = self.X
M = len(pts)
n_mispts = 0
for x,s in pts:
if int(np.sign(vec.T.dot(x))) != s:
n_mispts += 1
error = n_mispts / float(M)
return error

def choose_miscl_point(self, vec):
# Choose a random point among the misclassified
pts = self.X
mispts = []
for x,s in pts:
if int(np.sign(vec.T.dot(x))) != s:
mispts.append((x, s))
return mispts[random.randrange(0,len(mispts))]

def pla(self, save=False):
# Initialize the weigths to zeros
w = np.zeros(3)
X, N = self.X, len(self.X)
it = 0
# Iterate until all points are correctly classified
while self.classification_error(w) != 0:
it += 1
# Pick random misclassified point
x, s = self.choose_miscl_point(w)
# Update weights
w += s*x
if save:
self.plot(vec=w)
plt.title('N = %s, Iteration %s\n' \
% (str(N),str(it)))
plt.savefig('p_N%s_it%s' % (str(N),str(it)), \
dpi=200, bbox_inches='tight')
self.w = w

def check_error(self, M, vec):
check_pts = self.generate_points(M)
return self.classification_error(vec, pts=check_pts)


To start a run of the perceptron with 20 data points and visualize the initial configuration we simply initialize the Perceptron class and call the plot function:

p = Perceptron(20)
p.plot()


On the right is the plane that we obtain, divided in two by the black line. Red points are labeled as +1 while blue ones are -1. The purpose of the perceptron learning algorithm is to “learn” a linear classifier that correctly separates red from blue points given the labeled set of 20 points shown in the figure. This is, we want to learn the black line as faithfully as possible.

The call to p.pla() runs the algorithm and stores the final weights learned in p.w. To save a plot of each of the iterations, we can use the option p.pla(save=True). The following snippet will concatenate all images together to produce an animated gif of the running algorithm:

basedir = '/my/output/directory/'
os.chdir(basedir)
pngs = [pl for pl in os.listdir(basedir) if pl.endswith('png')]
sortpngs = sorted(pngs, key=lambda a:int(a.split('it')[1][:-4]))
basepng = pngs[0][:-8]
[sortpngs.append(sortpngs[-1]) for i in range(4)]
comm = ("convert -delay 50 %s %s.gif" % (' '.join(sortpngs), basepng)).split()
proc = subprocess.Popen(comm, stdin = subprocess.PIPE, stdout = subprocess.PIPE)
(out, err) = proc.communicate()


Below we can see how the algorithm tries successive values for the weight vector, leading to a succession of guesses for the linear separator, which are plotted in green. For as long as the green line misclassifies points (meaning that it does not separate the blue from the right points correctly), the perceptron keeps updating the weights in the manner described above. Eventually all points are on the correct side of the guessed line, the classification error in the training data set ($E_{in} = 0$ for the in-sample points) is thus zero and the algorithm converges and stops.

Clearly the final guessed green line after the 7 iterations does separate the training data points correctly but it does not completely agree with the target black line. An error in classifying not-seen data points is bound to exist ($E_{out} \neq 0$ for out-of-sample points), and we can quantify it easily by evaluating the performance of the linear classifier on fresh, unseen data points:

p.plot(p.generate_points(500), p.w, save=True)


In this image we can observe how, even if $E_{in} = 0$ for the training points represented by the large dots, $E_{out} \neq 0$, as shown by the small red and blue dots that fall on one side of the black (target) line but are incorrectly classified by the green (guessed) one. The exact out-of-sample error is given by the area between both lines. This can be thus computed analytically and exactly. But it can also be estimated with a repeated Monte Carlo sampling:

err = []
for i in range(100):
err.append(p.check_error(500, p.w))
np.mean(err)


0.0598200

The perceptron algorithm has thus learned a linear binary classifier that correctly classifies data in 94% of the cases, having an out-of-sample error rate of 6%.

### Table-top data experiment take-away message

The perceptron learning algorithm is a classical example of binary linear supervised classifier. Its implementation involves finding a linear boundary that completely separates points belonging to the two classes. If the data is linearly separable, then the procedure will converge to a weight vector that separates the data. (And if the data is inseparable, the PLA will never converge.) The perceptron is thus fundamentally limited in that its decision boundaries can only be linear. Eventually we will explore other methods that overcome this limitation, either combining multiple perceptrons in a single framework (neural networks) or by mapping features in an efficient way (kernels).

# Learning Machine Learning Online

The concept of distance, asynchronous learning is not an invention of the digital age. Correspondence and radio courses were already a thing in the past century, offering value to continuous learners and people with otherwise no possibility of attending traditional schools. With the Internet widely spreading over the last decade though, a market for massive open online courses has emerged. Coursera and edX might well be two of the best known providers. Incidentally, Andrew Ng, the co-founder of the former, is a professor of Computer Science and researcher in the field of artificial intelligence. Considering the fact that data science is enjoying lots of popularity at the moment, and that still few higher education institutions are offering comprehensive data science degrees, it is perhaps no surprise that some of the best attended online courses are on machine learning and related data analysis topics.

Below is a brief review of some of my favorite courses.

### Machine Learning by Andrew Ng, Stanford (via Coursera)

Ng’s course has become somewhat of a classic for machine learning beginners and a good introduction to the topic. During 10 weeks, the Stanford professor covers single- and multi-variable linear regression, logistic regression, regularization, neural networks, support vector machines, clustering, and recommender systems, and finishes with general advice for real applications of machine learning. The review quizzes, which can be repeated multiple times, allow the materials to sink in, and the programming exercises, which must be completed in octave/matlab, provide the opportunity to see the methods in action. Most mathematics gory details are skipped, which arguably appeals to students seeking a hands-on approach. The programming exercises are designed as supervised step-by-step milestones, which somehow constrains the room for creativity. On the other hand the provided scripts work flawlessly and help understand and visualize what one is doing.

### Learning From Data by Yaser S. Abu-Mostafa, Caltech (via edX)

Prof. Abu-Mostafa’s introductory machine learning course is a real gem. He’s an excellent educator who, not surprisingly, won the Feynman prize for excellence in teaching in 1996. The course covers similar material to that of Ng’s, however the focus is more on the underlying mathematical notions and is thus more formal. Each week two one-hour lectures are presented, followed by a homework set containing ten questions. At the end of the ten weeks a final exam is given. The participants are free to choose their favorite language to solve the exercises; only solutions, hence no code, are submitted. The questions in the quizzes can be answered only once, making it more challenging than its Coursera equivalent. I particularly enjoyed the rigorous way of explaining the theory of generalization, which really makes you understand when and why is machine learning possible. Here is an interesting interview with Abu-Mostafa.

### Introduction to Data Science by Bill Howe, University of Washington (via Coursera)

This course, taught by Prof. Howe, has an impressive syllabus to begin with. From relational algebra to parallel databases, including Hadoop, MapReduce, (No)SQL, text analysis, and visualization, it seems to cover everything a data scientist needs. In practice, the workload is slightly unbalanced, with some assignments being clearly more challenging than others. The assignments offer the possibility of getting familiar with catchy topics such as sentiment analysis of tweets, Kaggle competitions, MapReduce, Amazon Elastic Cloud and the popular visualization software Tableau. The homework comprises both automatically and peer-graded exercises. The 8-week duration feels a bit rushed, especially towards the end, but overall it was a fun compilation of assignments.

### Natural Language Processing by Michael Collins, Columbia University (via Coursera)

Michael Collins does a great job in this NLP course, which covers very interesting topics in a nice formal and rigorous way. His notes are superb, and the topics chosen provide a solid basis for computational linguistics. There are bi-weekly quizzes, plus three mandatory programming assignments that are challenging enough to keep students occupied for the 10-week course duration. Coding can be done in any language, submitted are just the result files that need to comply with a specific format. I really enjoyed implementing three of the algorithms that Collins very clearly presents in his videos: hidden Markov models for classification, the CKY decoder for parsing, and the translation alignments for the IBM models. NLP skills are certainly a nice addition to any data scientist’s set of tools.

### Data Analysis by Jeff Leek, Johns Hopkins University (via Coursera)

This is a course on applied statistics focused on data analysis. It puts an emphasis on teaching students how to organize and carry on a data analysis, and to write up a report end-to-end. In addition to the weekly review quizzes, there are two peer-reviewed data analysis assignments, which involve the submission of a full report plus figures and references. That in itself is a nice touch, since it forces you to explain clearly what you learned. The presentation of the statistics concepts and the weekly quizzes rely quite heavily on the R language. The course is not loaded with mathematical derivations, but it does provide a good introduction to the usage of R in data analysis, arguably the most spread tool among statisticians.

This is by no means a comprehensive list of online courses for machine learning. New programs are continuously added to the stack of materials of interest for data scientists. Don’t be afraid of enrolling and trying out some of the courses; you might find that the level and focus is exactly right for you, or else you can always let it sit for a semester and come back in future sessions, for many courses are offered yearly.