Machine Learning Classics: The Perceptron

An important problem in statistics and machine learning is that of classification, which is the task of identifying the category that an observation belongs to, on the basis of a training set of data containing other instances.

In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering or cluster analysis, and involves grouping data into categories based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional vector space). (Wikipedia)

At The Data Science Lab, we have already reviewed some basics of unsupervised classification with the Lloyd algorithm for k-means clustering and have investigated how to find the appropriate number of clusters. Today’s post will be devoted to a classical machine learning algorithm for supervised classification: the perceptron learning algorithm.

Theory behind the perceptron

The perceptron learning algorithm was invented in the late 1950s by Frank Rosenblatt. It belongs to the class of linear classifiers, this is, for a data set classified according to binary categories (which we will assume to be labeled +1 and -1), the classifier seeks to divide the two classes by a linear separator. The separator is a (n-1)-dimensional hyperplane in a n-dimensional space, in particular it is a line in the plane and a plane in the 3-dimensional space.

Our data set will be assumed to consist of N observations characterized by d features or attributes, $\mathbf{x}_n = (x_1, \ldots, x_d)$ for $n = (1, \ldots, N)$. The problem of binary classifying these data points can be translated to that of finding a series of weights $w_i$ such that all vectors verifying $\displaystyle \sum_{i=1}^d w_i x_i < b$

are assigned to one of the classes whereas those verifying $\displaystyle \sum_{i=1}^d w_i x_i > b$

are assigned to the other, for a given threshold value $b$. If we rename $b = w_0$ and introduce an artificial coordinate $x_0 = 1$ in our vectors $\mathbf{x}_n$, we can write the perceptron separator formula as $\displaystyle h(\mathbf{x}) = \mathrm{sign}\left(\sum_{i=0}^d w_i x_i\right) = \mathrm{sign}\left( \mathbf{w}^{\mathbf{T}}\mathbf{x}\right)$

Note that $\mathbf{w}^{\mathbf{T}}\mathbf{x}$ is notation for the scalar product between vectors $\mathbf{w}$ and $\mathbf{x}$. Thus the problem of classifying is that of finding the vector of weights $\mathbf{w}$ given a training data set of N vectors $\mathbf{x}$ with their corresponding labeled classification vector $(y_1, \ldots, y_N)$.

The perceptron learning algorithm (PLA)

The learning algorithm for the perceptron is online, meaning that instead of considering the entire data set at the same time, it only looks at one example at a time, processes it and goes on to the next one. The algorithm starts with a guess for the vector $\mathbf{w}$ (without loss of generalization one can begin with a vector of zeros). It then assesses how good of a guess that is by comparing the predicted labels with the actual, correct labels (remember that those are available for the training test, since we are doing supervised learning). As long as there are misclassified points, the algorithm corrects its guess for the weight vector by updating the weights in the correct direction, until all points are correctly classified.

That direction is as follows: given a labeled training data set, if $\mathbf{w}$ is the guessed weight vector and $\mathbf{x}_n$ is an incorrectly classified point with $\mathbf{w}^{\mathbf{T}}\mathbf{x}_n \neq y_n$, then the weight $\mathbf{w}$ is updated to $\mathbf{w} + y_n \mathbf{x}_n$. This is illustrated in the plot on the right, taken from this clear article on the perceptron.

A nice feature of the perceptron learning rule is that if there exist a set of weights that solve the problem (i.e. if the data is linearly separable), then the perceptron will find these weights.

A python implementation of the perceptron

For our python implementation we will use a trivial example on two dimensions: within the $[-1,1]\times[-1,1]$ space, we define two random points and draw the line that joins them. The general equation of a line given two points in it, $(x_1, y_1)$ and $(x_2, y_2)$, is $A + Bx + Cy = 0$ where $A, B, C$ can be written in terms of the two points. Defining a vector $\mathrm{V} = (A, B, C)$, any point $(x,y)$ belongs to the line if $\mathrm{V}^\mathrm{T}\mathrm{x} = 0$, where $\mathrm{x} = (1,x,y)$. Points for which the dot product is positive fall on one side of the line, negatives fall on the other.

This procedure automatically divides the plane linearly in two regions. We randomly choose N points in this space and classify them as +1 or -1 according to the dividing line defined before. The Perceptron class defined below is initialized exactly in this way. The perceptron learning algorithm is implemented in the pla function, and the classification error, defined as the fraction of misclassified points, is in classification_error. The code is as follows:

import numpy as np
import random
import os, subprocess

class Perceptron:
def __init__(self, N):
# Random linearly separated data
xA,yA,xB,yB = [random.uniform(-1, 1) for i in range(4)]
self.V = np.array([xB*yA-xA*yB, yB-yA, xA-xB])
self.X = self.generate_points(N)

def generate_points(self, N):
X = []
for i in range(N):
x1,x2 = [random.uniform(-1, 1) for i in range(2)]
x = np.array([1,x1,x2])
s = int(np.sign(self.V.T.dot(x)))
X.append((x, s))
return X

def plot(self, mispts=None, vec=None, save=False):
fig = plt.figure(figsize=(5,5))
plt.xlim(-1,1)
plt.ylim(-1,1)
V = self.V
a, b = -V/V, -V/V
l = np.linspace(-1,1)
plt.plot(l, a*l+b, 'k-')
cols = {1: 'r', -1: 'b'}
for x,s in self.X:
plt.plot(x, x, cols[s]+'o')
if mispts:
for x,s in mispts:
plt.plot(x, x, cols[s]+'.')
if vec != None:
aa, bb = -vec/vec, -vec/vec
plt.plot(l, aa*l+bb, 'g-', lw=2)
if save:
if not mispts:
plt.title('N = %s' % (str(len(self.X))))
else:
plt.title('N = %s with %s test points' \
% (str(len(self.X)),str(len(mispts))))
plt.savefig('p_N%s' % (str(len(self.X))), \
dpi=200, bbox_inches='tight')

def classification_error(self, vec, pts=None):
# Error defined as fraction of misclassified points
if not pts:
pts = self.X
M = len(pts)
n_mispts = 0
for x,s in pts:
if int(np.sign(vec.T.dot(x))) != s:
n_mispts += 1
error = n_mispts / float(M)
return error

def choose_miscl_point(self, vec):
# Choose a random point among the misclassified
pts = self.X
mispts = []
for x,s in pts:
if int(np.sign(vec.T.dot(x))) != s:
mispts.append((x, s))
return mispts[random.randrange(0,len(mispts))]

def pla(self, save=False):
# Initialize the weigths to zeros
w = np.zeros(3)
X, N = self.X, len(self.X)
it = 0
# Iterate until all points are correctly classified
while self.classification_error(w) != 0:
it += 1
# Pick random misclassified point
x, s = self.choose_miscl_point(w)
# Update weights
w += s*x
if save:
self.plot(vec=w)
plt.title('N = %s, Iteration %s\n' \
% (str(N),str(it)))
plt.savefig('p_N%s_it%s' % (str(N),str(it)), \
dpi=200, bbox_inches='tight')
self.w = w

def check_error(self, M, vec):
check_pts = self.generate_points(M)
return self.classification_error(vec, pts=check_pts)

To start a run of the perceptron with 20 data points and visualize the initial configuration we simply initialize the Perceptron class and call the plot function:

p = Perceptron(20)
p.plot() On the right is the plane that we obtain, divided in two by the black line. Red points are labeled as +1 while blue ones are -1. The purpose of the perceptron learning algorithm is to “learn” a linear classifier that correctly separates red from blue points given the labeled set of 20 points shown in the figure. This is, we want to learn the black line as faithfully as possible.

The call to p.pla() runs the algorithm and stores the final weights learned in p.w. To save a plot of each of the iterations, we can use the option p.pla(save=True). The following snippet will concatenate all images together to produce an animated gif of the running algorithm:

basedir = '/my/output/directory/'
os.chdir(basedir)
pngs = [pl for pl in os.listdir(basedir) if pl.endswith('png')]
sortpngs = sorted(pngs, key=lambda a:int(a.split('it')[:-4]))
basepng = pngs[:-8]
[sortpngs.append(sortpngs[-1]) for i in range(4)]
comm = ("convert -delay 50 %s %s.gif" % (' '.join(sortpngs), basepng)).split()
proc = subprocess.Popen(comm, stdin = subprocess.PIPE, stdout = subprocess.PIPE)
(out, err) = proc.communicate()

Below we can see how the algorithm tries successive values for the weight vector, leading to a succession of guesses for the linear separator, which are plotted in green. For as long as the green line misclassifies points (meaning that it does not separate the blue from the right points correctly), the perceptron keeps updating the weights in the manner described above. Eventually all points are on the correct side of the guessed line, the classification error in the training data set ( $E_{in} = 0$ for the in-sample points) is thus zero and the algorithm converges and stops. Clearly the final guessed green line after the 7 iterations does separate the training data points correctly but it does not completely agree with the target black line. An error in classifying not-seen data points is bound to exist ( $E_{out} \neq 0$ for out-of-sample points), and we can quantify it easily by evaluating the performance of the linear classifier on fresh, unseen data points:

p.plot(p.generate_points(500), p.w, save=True) In this image we can observe how, even if $E_{in} = 0$ for the training points represented by the large dots, $E_{out} \neq 0$, as shown by the small red and blue dots that fall on one side of the black (target) line but are incorrectly classified by the green (guessed) one. The exact out-of-sample error is given by the area between both lines. This can be thus computed analytically and exactly. But it can also be estimated with a repeated Monte Carlo sampling:

err = []
for i in range(100):
err.append(p.check_error(500, p.w))
np.mean(err)

0.0598200

The perceptron algorithm has thus learned a linear binary classifier that correctly classifies data in 94% of the cases, having an out-of-sample error rate of 6%.

Table-top data experiment take-away message

The perceptron learning algorithm is a classical example of binary linear supervised classifier. Its implementation involves finding a linear boundary that completely separates points belonging to the two classes. If the data is linearly separable, then the procedure will converge to a weight vector that separates the data. (And if the data is inseparable, the PLA will never converge.) The perceptron is thus fundamentally limited in that its decision boundaries can only be linear. Eventually we will explore other methods that overcome this limitation, either combining multiple perceptrons in a single framework (neural networks) or by mapping features in an efficient way (kernels).

List Comprehension in Python

I stumbled across this nice tutorial on advanced design patterns in python today, and especially liked the following image that explains graphically what list comprehension is: List comprehension in python is extremely flexible and powerful. Let us practice some more with further neat examples of it:

Square all non-negative numbers smaller than 10

[x**2 for x in range(10)]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Non-negative multiples of 5 smaller than 100

[x for x in range(100) if x%5 == 0]

[0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95]

Non-negative multiples of 3 but not multiples of 6 smaller than 50

[x for x in range(50) if x%3 == 0 and x%6 != 0]

[3, 9, 15, 21, 27, 33, 39, 45]

All consonants in a given sentence (without repetition)

import string
punct = string.punctuation + ' '
vowels = "aeiou"
phrase = "On second thought, let's not go to Camelot. It is a silly place."
set([c for c in phrase.lower() if c not in vowels and c not in punct])

{'c', 'd', 'g', 'h', 'l', 'm', 'n', 'p', 's', 't', 'y'}

First character of every word in a sentence

[w for w in phrase.split()]

['O', 's', 't', 'l', 'n', 'g', 't', 'C', 'I', 'i', 'a', 's', 'p']

Substitute all vowels in a sentence by the character ‘0’

"".join([c if c not in vowels else '0' for c in phrase])

"On s0c0nd th00ght, l0t's n0t g0 t0 C0m0l0t. It 0s 0 s0lly pl0c0."

Pairs of elements drawn from different lists

words1 = ['Lancelot', 'Robin', 'Galahad']
words2 = ['Camelot', 'Assyria']
[(w1,w2) for w1 in words1 for w2 in words2]

[('Lancelot', 'Camelot'), ('Lancelot', 'Assyria'), ('Robin', 'Camelot'), ('Robin', 'Assyria'), ('Galahad', 'Camelot'), ('Galahad', 'Assyria')]

I will update this list as more interesting and useful examples come to mind. What’s your favorite use of list comprehension and how many lines of code did it save you?

Yet Another Game Of Life

The famous Game of Life, devised by mathematician John Conway about four decades ago, is the best-known example of a cellular automaton. It has probably been coded up in every possible computer language about a million times by now, as evidenced by this plethora of examples. I was introduced to the rules of the game already 15 years ago during my first year in college, and became immediately fascinated by the complexity and beauty of some of the emerging patterns. At that time I implemented it in C, and soon forgot about it. This is already a very good reason to revisit the game, its rules, and to show animations produced with matplotlib in python. Let’s go!

Rules of Conway’s game of life

As per Wikipedia, the universe of the Game of Life is an infinite two-dimensional grid of square cells, each of which is in one of two possible states, alive or dead. Every cell interacts with its eight neighbors, which are the cells that are horizontally, vertically, or diagonally adjacent. At each step in time, the following transitions occur:

1. Any live cell with fewer than two live neighbors dies, as if caused by under-population.
2. Any live cell with two or three live neighbors lives on to the next generation.
3. Any live cell with more than three live neighbors dies, as if by overcrowding.
4. Any dead cell with exactly three live neighbors becomes a live cell, as if by reproduction.

Yet another python implementation of Life

Our python implementation will use a two-dimensional numpy array to store the grid representing the universe, with values 1 for live and 0 for dead cells. We will code a init function for the initialization of the grid and a evolve routine for the evolution of the universe. A very simple way of initializing the grid to random values, allowing for variable grid dimensions, is as follows:

def init_universe(rows, cols):
grid = np.zeros([rows, cols])
for i in range(rows):
for j in range(cols):
grid[i][j] = round(random.random())
return grid An example of a random universe created with the init_universe(rows, cols) function with 600 cells distributed in 20 rows and 30 columns can be seen in the figure on the right. The call to generate the figure, with black cells representing live (or 1) states, is as follows:

grid = init_universe(20,30)
ax = plt.axes()
ax.matshow(grid,cmap=cm.binary)
ax.set_axis_off()

Now, for the evolution logic, let us code a function that takes a universe as input, together with the parameters that regulate its evolution, and outputs the new universe after one iteration. The classical rules of the game of life set the parameters for overcrowding, under-population and reproduction as 3, 2, 3, respectively. In our implementation, we create a padding around the original universe, which allows us to define the neighbors in an easy way without having to worry whether a particular cell is at the border or not. At every position i,j we compute the sum of all cells in positions $[i-1, i, i+1] \times [j-1, j, j+1]$ and then we subtract the center point at i,j. Then we apply the evolution logic: cells die when underpopulated or overcrowded, and new cells are born when the reproduction condition (3 alive neighbors) is fulfilled:

def evolve(grid, pars):
overcrowd, underpop, reproduction = pars
rows, cols = grid.shape
newgrid = np.zeros([rows, cols])
neighbors = np.zeros([rows,cols])
# Auxiliary padded grid
padboard = np.zeros([rows+2, cols+2])
# Compute neighbours and newgrid
for i in range(rows):
for j in range(cols):
neighbors[i][j] += sum([padboard[a][b] for a in [i-1, i, i+1] \
for b in [j-1, j, j+1]])
# Evolution logic
newgrid[i][j] = grid[i][j]
if grid[i][j] and \
(neighbors[i][j] > overcrowd or neighbors[i][j] < underpop):
newgrid[i][j] = 0
elif not grid[i][j] and neighbors[i][j] == reproduction:
newgrid[i][j] = 1
return newgrid

Note that in the above code we make use of a wonderful property of arrays in python, namely that the last element of an array arr can be referenced either as arr[len(arr)-1] or as arr[-1]. Thus, we create a padboard with 2 columns and 2 rows more than the dimensions of the grid. If n is the number of rows of the grid, the padboard has n+2 rows, which range from 0 to n+1, or, equivalently, from -1 to n!

Visualization of Life and creation of mp4 videos with matplotlib

For the visualization of the evolution of our random universe we could create a series of png plots and stitch them together to produce an animated gif. However, matplotlib also offers the possibility of generating animations and saving them directly in mp4 format. The code that follows is based on this very useful tutorial, which contains instructions to embed matplotlib animations directly in the ipython notebook.

pars = 3, 2, 3
rows, cols = 20, 20
fig = plt.figure()
ax = plt.axes()
im = ax.matshow(init_universe(rows,cols),cmap=cm.binary)
ax.set_axis_off()

def init():
im.set_data(init_universe(rows, cols))

def animate(i):
a = im.get_array()
a = evolve(a, pars)
im.set_array(a)
return [im]

In the code above, we have set the parameters for the evolution as described by the original logic of the game, and we initialize a matplotlib figure using the matshow directive. We also need the functions init and animate; the latter updates the content of the plot with evolved iterations of the universe. The matplotlib call to produce an animation and save it in mp4 format is then simply:

anim = animation.FuncAnimation(fig, animate, init_func=init, frames=100, blit=True)

anim.save('animation_random.mp4', fps=10) # fps = FramesPerSecond

A random initialization gives rise to the following animation, which ends in a configuration with 3 stable patterns (block, blinker and a diamond-shaped structure that settles to a blinker in 8 steps) after approximately 50 iterations.

While a random initialization often gives rise to interesting universes, there is a vast body of research devoted to classifying particular configurations that are known to evolve in a specific fashion (oscillators, stable figures, moving patterns…). A quick google search illustrates this point and leads to many resources for the interested reader. For starters, let us code up the initialization function of a “pulsar”, a type of oscillator with a 3-iteration period.

def init_universe_pulsar():
grid = zeros([15, 15])
line = zeros(15)
line[3:6] = 1
line[9:12] = 1
for ind in [1,6,8,13]:
grid[ind] = line
grid[:,ind] = line
return grid

To generate and save this universe, we need to modify the function init used to produce the matplotlib animation and replace the call to init_universe(rows, cols) with init_universe_pulsar(). The resulting evolution can be seen in the following video:

An interesting kind of universes are those that resemble spacecrafts. There are many of them, as a visit to this page shows. Other configurations resemble guns that emit gliders forever. By far the most famous one is the Cosper glider gun, which can be generated using the following initialization function:

def init_universe_glider_gun():
glider_gun = 38*'0' + 25*'0'+'1'+12*'0' + 23*'0'+'101'+12*'0' +\
13*'0'+'11'+6*'0'+'11'+12*'0'+'11'+'0' +\
12*'0'+'1'+3*'0'+'1'+4*'0'+'11'+12*'0'+'11'+'0' +\
'0'+'11'+8*'0'+'1'+5*'0'+'100011'+15*'0' +\
'0'+'11'+8*'0'+'1'+'000'+'1011'+4*'0'+'101'+12*'0' +\
11*'0'+'1000001'+7*'0'+'1'+12*'0' +\
12*'0'+'10001'+21*'0' + 13*'0'+'11'+23*'0' + 38*'0' +\
19*38*'0'
grid = np.array([float(g) for g in glider_gun]).reshape(30,38)
return grid

Once started, this glider gun evolves emitting gliders indefinitely, which move across the grid at -45 degrees and exit the universe bounding box through the bottom right corner. Bill Gosper discovered this first glider gun, which is so far the smallest one ever found, in 1970 and got 50 dollars from Conway for that. The discovery of the glider gun eventually led to the proof that Conway’s Game of Life could function as a Turing machine. A video of the Gosper glider gun in action can be seen below.

Here is a very nice visualization of yet another type of configuration in Life, the so-called “puffers”, patterns that move like a spaceship but leave debris behind as they evolve.

Table-top data experiment take-away message

Conway’s game of life is a zero-player game, with evolution completely determined by its initial state, consisting on live and dead cells on a two-dimensional grid. The state of each cell varies with each iteration according to the number of populated neighbors in the adjacent cells. The game, devised in 1970, opened up a whole new area of mathematical research, the field of cellular automata, and belongs to a growing class of what are called “simulation games”. Implementing the evolution algorithm behind the game in any programming language is a classical exercise in many CS schools, is always a lot of fun, and allows to explore the huge variety of configurations that give rise to strangely addictive evolution patterns.