Machine Learning Classics: The Perceptron

An important problem in statistics and machine learning is that of classification, which is the task of identifying the category that an observation belongs to, on the basis of a training set of data containing other instances.

In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering or cluster analysis, and involves grouping data into categories based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional vector space). (Wikipedia)

At The Data Science Lab, we have already reviewed some basics of unsupervised classification with the Lloyd algorithm for k-means clustering and have investigated how to find the appropriate number of clusters. Today’s post will be devoted to a classical machine learning algorithm for supervised classification: the perceptron learning algorithm.

Theory behind the perceptron

The perceptron learning algorithm was invented in the late 1950s by Frank Rosenblatt. It belongs to the class of linear classifiers, this is, for a data set classified according to binary categories (which we will assume to be labeled +1 and -1), the classifier seeks to divide the two classes by a linear separator. The separator is a (n-1)-dimensional hyperplane in a n-dimensional space, in particular it is a line in the plane and a plane in the 3-dimensional space.

Our data set will be assumed to consist of N observations characterized by d features or attributes, \mathbf{x}_n = (x_1, \ldots, x_d) for n = (1, \ldots, N). The problem of binary classifying these data points can be translated to that of finding a series of weights w_i such that all vectors verifying

\displaystyle \sum_{i=1}^d w_i x_i < b

are assigned to one of the classes whereas those verifying

\displaystyle \sum_{i=1}^d w_i x_i > b

are assigned to the other, for a given threshold value b. If we rename b = w_0 and introduce an artificial coordinate x_0 = 1 in our vectors \mathbf{x}_n, we can write the perceptron separator formula as

\displaystyle h(\mathbf{x}) = \mathrm{sign}\left(\sum_{i=0}^d w_i x_i\right) = \mathrm{sign}\left( \mathbf{w}^{\mathbf{T}}\mathbf{x}\right)

Note that \mathbf{w}^{\mathbf{T}}\mathbf{x} is notation for the scalar product between vectors \mathbf{w} and \mathbf{x}. Thus the problem of classifying is that of finding the vector of weights \mathbf{w} given a training data set of N vectors \mathbf{x} with their corresponding labeled classification vector (y_1, \ldots, y_N).

The perceptron learning algorithm (PLA)

The learning algorithm for the perceptron is online, meaning that instead of considering the entire data set at the same time, it only looks at one example at a time, processes it and goes on to the next one. The algorithm starts with a guess for the vector \mathbf{w} (without loss of generalization one can begin with a vector of zeros). perceptron_updateIt then assesses how good of a guess that is by comparing the predicted labels with the actual, correct labels (remember that those are available for the training test, since we are doing supervised learning). As long as there are misclassified points, the algorithm corrects its guess for the weight vector by updating the weights in the correct direction, until all points are correctly classified.

That direction is as follows: given a labeled training data set, if \mathbf{w} is the guessed weight vector and \mathbf{x}_n is an incorrectly classified point with \mathbf{w}^{\mathbf{T}}\mathbf{x}_n \neq y_n, then the weight \mathbf{w} is updated to \mathbf{w} + y_n \mathbf{x}_n. This is illustrated in the plot on the right, taken from this clear article on the perceptron.

A nice feature of the perceptron learning rule is that if there exist a set of weights that solve the problem (i.e. if the data is linearly separable), then the perceptron will find these weights.

A python implementation of the perceptron

For our python implementation we will use a trivial example on two dimensions: within the [-1,1]\times[-1,1] space, we define two random points and draw the line that joins them. The general equation of a line given two points in it, (x_1, y_1) and (x_2, y_2), is A + Bx + Cy = 0 where A, B, C can be written in terms of the two points. Defining a vector \mathrm{V} = (A, B, C), any point (x,y) belongs to the line if \mathrm{V}^\mathrm{T}\mathrm{x} = 0, where \mathrm{x} = (1,x,y). Points for which the dot product is positive fall on one side of the line, negatives fall on the other.

This procedure automatically divides the plane linearly in two regions. We randomly choose N points in this space and classify them as +1 or -1 according to the dividing line defined before. The Perceptron class defined below is initialized exactly in this way. The perceptron learning algorithm is implemented in the pla function, and the classification error, defined as the fraction of misclassified points, is in classification_error. The code is as follows:

import numpy as np
import random
import os, subprocess

class Perceptron:
    def __init__(self, N):
        # Random linearly separated data
        xA,yA,xB,yB = [random.uniform(-1, 1) for i in range(4)]
        self.V = np.array([xB*yA-xA*yB, yB-yA, xA-xB])
        self.X = self.generate_points(N)

    def generate_points(self, N):
        X = []
        for i in range(N):
            x1,x2 = [random.uniform(-1, 1) for i in range(2)]
            x = np.array([1,x1,x2])
            s = int(np.sign(
            X.append((x, s))
        return X

    def plot(self, mispts=None, vec=None, save=False):
        fig = plt.figure(figsize=(5,5))
        V = self.V
        a, b = -V[1]/V[2], -V[0]/V[2]
        l = np.linspace(-1,1)
        plt.plot(l, a*l+b, 'k-')
        cols = {1: 'r', -1: 'b'}
        for x,s in self.X:
            plt.plot(x[1], x[2], cols[s]+'o')
        if mispts:
            for x,s in mispts:
                plt.plot(x[1], x[2], cols[s]+'.')
        if vec != None:
            aa, bb = -vec[1]/vec[2], -vec[0]/vec[2]
            plt.plot(l, aa*l+bb, 'g-', lw=2)
        if save:
            if not mispts:
                plt.title('N = %s' % (str(len(self.X))))
                plt.title('N = %s with %s test points' \
                          % (str(len(self.X)),str(len(mispts))))
            plt.savefig('p_N%s' % (str(len(self.X))), \
                        dpi=200, bbox_inches='tight')

    def classification_error(self, vec, pts=None):
        # Error defined as fraction of misclassified points
        if not pts:
            pts = self.X
        M = len(pts)
        n_mispts = 0
        for x,s in pts:
            if int(np.sign( != s:
                n_mispts += 1
        error = n_mispts / float(M)
        return error

    def choose_miscl_point(self, vec):
        # Choose a random point among the misclassified
        pts = self.X
        mispts = []
        for x,s in pts:
            if int(np.sign( != s:
                mispts.append((x, s))
        return mispts[random.randrange(0,len(mispts))]

    def pla(self, save=False):
        # Initialize the weigths to zeros
        w = np.zeros(3)
        X, N = self.X, len(self.X)
        it = 0
        # Iterate until all points are correctly classified
        while self.classification_error(w) != 0:
            it += 1
            # Pick random misclassified point
            x, s = self.choose_miscl_point(w)
            # Update weights
            w += s*x
            if save:
                plt.title('N = %s, Iteration %s\n' \
                          % (str(N),str(it)))
                plt.savefig('p_N%s_it%s' % (str(N),str(it)), \
                            dpi=200, bbox_inches='tight')
        self.w = w

    def check_error(self, M, vec):
        check_pts = self.generate_points(M)
        return self.classification_error(vec, pts=check_pts)

To start a run of the perceptron with 20 data points and visualize the initial configuration we simply initialize the Perceptron class and call the plot function:

p = Perceptron(20)

On the right is the plane that we obtain, divided in two by the black line. Red points are labeled as +1 while blue ones are -1. The purpose of the perceptron learning algorithm is to “learn” a linear classifier that correctly separates red from blue points given the labeled set of 20 points shown in the figure. This is, we want to learn the black line as faithfully as possible.

The call to p.pla() runs the algorithm and stores the final weights learned in p.w. To save a plot of each of the iterations, we can use the option p.pla(save=True). The following snippet will concatenate all images together to produce an animated gif of the running algorithm:

basedir = '/my/output/directory/'
pngs = [pl for pl in os.listdir(basedir) if pl.endswith('png')]
sortpngs = sorted(pngs, key=lambda a:int(a.split('it')[1][:-4]))
basepng = pngs[0][:-8]
[sortpngs.append(sortpngs[-1]) for i in range(4)]
comm = ("convert -delay 50 %s %s.gif" % (' '.join(sortpngs), basepng)).split()
proc = subprocess.Popen(comm, stdin = subprocess.PIPE, stdout = subprocess.PIPE)
(out, err) = proc.communicate()

Below we can see how the algorithm tries successive values for the weight vector, leading to a succession of guesses for the linear separator, which are plotted in green. For as long as the green line misclassifies points (meaning that it does not separate the blue from the right points correctly), the perceptron keeps updating the weights in the manner described above. Eventually all points are on the correct side of the guessed line, the classification error in the training data set (E_{in} = 0 for the in-sample points) is thus zero and the algorithm converges and stops.


Clearly the final guessed green line after the 7 iterations does separate the training data points correctly but it does not completely agree with the target black line. An error in classifying not-seen data points is bound to exist (E_{out} \neq 0 for out-of-sample points), and we can quantify it easily by evaluating the performance of the linear classifier on fresh, unseen data points:

p.plot(p.generate_points(500), p.w, save=True)

p_N20In this image we can observe how, even if E_{in} = 0 for the training points represented by the large dots, E_{out} \neq 0, as shown by the small red and blue dots that fall on one side of the black (target) line but are incorrectly classified by the green (guessed) one. The exact out-of-sample error is given by the area between both lines. This can be thus computed analytically and exactly. But it can also be estimated with a repeated Monte Carlo sampling:

err = []
for i in range(100):
    err.append(p.check_error(500, p.w))


The perceptron algorithm has thus learned a linear binary classifier that correctly classifies data in 94% of the cases, having an out-of-sample error rate of 6%.

Table-top data experiment take-away message

The perceptron learning algorithm is a classical example of binary linear supervised classifier. Its implementation involves finding a linear boundary that completely separates points belonging to the two classes. If the data is linearly separable, then the procedure will converge to a weight vector that separates the data. (And if the data is inseparable, the PLA will never converge.) The perceptron is thus fundamentally limited in that its decision boundaries can only be linear. Eventually we will explore other methods that overcome this limitation, either combining multiple perceptrons in a single framework (neural networks) or by mapping features in an efficient way (kernels).


  1. jamone

    First, you forgot to add the “import matplotlib” line at the beginning.
    Second, p.plot() seems not to do anything.
    If it does generate a plot, where does the plot display?

  2. Dominic

    How does the Vector inner product works with $V^{T}x$? I’m a bit lost, especially with how you’ve defined the vector in your code… Is there an elegant proof I can follow?

  3. sergulaydore

    Thanks for the post. I am having problem with comm = (“convert -delay 50 %s %s.gif” % (‘ ‘.join(sortpngs), basepng)).split()
    Why are you inserting .gif to png files? In my case, comm is like this:


    But it does not work when I feed it to Popen. Can you please guide me with this?

    Error says:

    OSError: [Errno 2] No such file or directory


  4. Pingback: Introducing the Perceptron | Limited Rationality
  5. AOS

    Hi, thanks for the great job. I could run this code in my computer, but I have a question. This code is generating the input values with random values. I have some data from different sensors and I want to classifying them with this method. What should I do now? How can I use these data as input to this code.
    Thanks in advance.

    • datasciencelab

      Hi AOS,
      to feed the code above any other data than the random values, you need to modify function generate_points of the Perceptron class. You need to fill the array self.X with the data from your sensors, which may be in a text file that you open and read inside the generate_points function. Assuming your sensor data is 2-dimentional, array self.X is of the form [(x0,y0), (x1,y1), … , (xN, yN)] . If your sensors data is of higher dimension, you need to fill self.X as [(x00,x10,…,xk0), (x01,x11,…,xk1), … , (x0N, x1N,…,xkN)], where k is the number of dimensions of your data points and N the number of data points you have.

      • Lenny

        Hi, thanks for your code example. I would also like to modify the function generate_points in order to deal with my own sensor data. I filled the vector X with the data just like you described in your post. My question ist now, how to fill the vector V.
        I don’t understand what this vector actually does.
        Thank you in advance!

  6. henry


    quick question: line #9 is self.V = np.array([xB*yA-xA*yB, yB-yA, xA-xB]) and it seems like the way you constructed it isn’t something important that self.V can also be something like np.array([xA*yA/xB*yB, xB-yA, yA-xB]) because it is just the hyperplane we randomly come up at the beginning and then the actual x-y pairs (data points) will be generated based on that hyperplane (i.e. self.V). Am I correct? Thanks in advance.

    • Nanos Mangart

      Hi, maybe I am missing something, but with the definition on line #9 the geometric meaning is clear. If A=(1,xa,ya) and B=(1,xb,yb) are two points defining a line, then V=AxB is the cross product between A and B. And the “Points for which the dot product is positive fall on one side of the line, negatives fall on the other.” only makes sense when V is defined as above.

  7. Pingback: d191: Machine Learning Classics: The Perceptron – AI
  8. Chowdhury Rahim

    I am trying to determine how fast does the PLA converge. so I changed the codes as follows:
    while self.classification_error(w) != 0:
    it += 1
    iter = it

    # Pick random misclassified point
    x, s = self.choose_miscl_point(w)
    # Update weights
    w += s*x
    if save:
    plt.title(‘N = %s, Iteration %s\n’ \
    % (str(N),str(it)))
    plt.savefig(‘p_N%s_it%s’ % (str(N),str(it)), \
    dpi=200, bbox_inches=’tight’)

    self.w = w

    i.e. added a variable iter, and tried to print it. Nothing shows up. Can you please comment how to fix it?

Post a comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s