A Perceptron is a basic learning algorithm invented in 1959 by Frank Rosenblatt. It is meant to mimic the working logic of a biological neuron. The human brain is basically a collection of many interconnected neurons. Each one receives a set of inputs, applies some sort of computation on them and propagates the result to other neurons.

That’s exactly what Perceptron is going to do. Given a sample, the neuron classifies it by assigning a weight to its features. To accomplish this a Perceptron undergoes two phases: training and testing. During training phase weights are initialized to an arbitrary value. Perceptron is then asked to evaluate a sample and compare its decision with the actual class of the sample.

User:Dhp1080 CC BY-SA 3.0 via Wikimedia Commons

If the algorithm chose the wrong class weights are adjusted to better match that particular sample. This process is repeated over and over to finely optimize the biases. After that, the algorithm is ready to be tested against a new set of completely unknown samples to evaluate if the trained model is general enough to cope with real-world samples.

This as basic as possible description immediately shows many of the common problems we need to face while working with data sets:

  • Models have to be trained with a high number of already classified samples. It is difficult to know a priori this number: a few dozen may be enough in very simple cases while in others thousands or more are needed.
  • Data is almost never perfect: a preprocessing phase has to take care of missing features, uncorrelated data and, as we are going to see soon, scaling.
  • Perceptron requires linearly separable samples to achieve convergence.

The math of Perceptron

If we represent samples as vectors of size $n$, where $n$ is the number of its features, a Perceptron can be modelled through the composition of two functions. The first one $ f(x) $ maps the input features $ x $ vector to a scalar value, shifted by a bias $ b $:

$$ f(x) = \begin{cases} \mathbb{R}^n \rightarrow \mathbb{R} \newline w \cdot x + b \end{cases} $$

where $ w \cdot x $ is the dot product between $ w $ and $ x $ defined as $ \sum_{i=1}^{n} w_i x_i $.

A threshold function, usually Heaviside or sign functions, maps the scalar value to a binary output:

$$ t(x) = sgn(f(x)) = \begin{cases} 1 & \text{if } f(x) \geq 0 \newline -1 & \text{otherwise} \end{cases} $$

Indeed if the neuron output is exactly zero we cannot assume the sample belongs to the first sample since it lies on the boundary between the two classes. Nonetheless for the sake of simplicity we just ignore this situation.

Let’s dive into the code

The code is written in Python using just numpy, pandas, matplotlib and a few useful functions from sklearn. You can install everything you need by downloading the Anaconda distribution. Or simply head to Google Colab.

Our Perceptron model is implemented by just one class which exposes two fundamental methods: fit and predict. Tuning is achieved by two so called hyper-parameters: a correction factor learning_rate applied to weights when the model wrongly predicts a sample, and the number n_iter of repetitions of the training process. We can randomly peek some values and tweak them later to see if the model response changes accordingly.

import numpy as np


class Perceptron:
    def __init__(self, learning_rate=0.1):
        self.learning_rate = learning_rate
        self.n_iter = n_iter
        self._b = 0.0
        self._w = None
        self.misclassified_samples = []

Binary output of the model is given by predict method. It computes the net output of the neuron and returns 1 or -1 if the sample is of class 1 or -1, respectively.

def f(self, x: np.array) -> float:
    return np.dot(x, self._w) + self._b

def predict(self, x: np.array) -> int:
    return np.where(self.f(x) >= 0.0, 1, -1)

The core method which actually trains the model against the test data set is fit. It receives two parameters:

  • x: an $ m \times n$ matrix where m is the number of samples in the test set and n is the number of features.
  • y: array of labels associated to $i^{\text{th}}$ sample in x.

For each sample of each iteration it computes update which is the distance between the predicted label and actual label scaled by learning_rate. If the prediction is correct update is equal to 0. Otherwise the weights array is updated to take into account the error.

def fit(self, x: np.array, y: np.array):
    self._b = 0.0
    self._w = np.zeros(x.shape[1])
    self._misclassified_samples = []

    for _ in range(self.n_iter):
        errors = 0  # errors during this iteration
        for xi, yi in zip(x, y):
            update = self.learning_rate * (yi - self.predict(xi))
            self._b += update
            self._w += update * xi
            errors += int(update != 0.0)
        self.misclassified_samples.append(errors)

Now let’s assume a sample defined by its vector of features $X$ is wrongly labelled with 1. Then

self.learning_rate * (yi - self.predict(xi))
is equal to $0.1 * (-1 - 1) = -0.2$. This values is used to update _b and _w parameters.

Here’s the code of the Perceptron class in its entireness:

import numpy as np


class Perceptron:
    """
    Perceptron neuron
    """

    def __init__(self, learning_rate=0.1):
        """
        instantiate a new Perceptron

        :param learning_rate: coefficient used to tune the model
        response to training data
        """
        self.learning_rate = learning_rate
        self._b = 0.0  # y-intercept
        self._w = None  # weights assigned to input features
        # count of errors during each iteration
        self.misclassified_samples = []

    def fit(self, x: np.array, y: np.array, n_iter=10):
        """
        fit the Perceptron model on the training data

        :param x: samples to fit the model on
        :param y: labels of the training samples
        :param n_iter: number of training iterations 
        """
        self._b = 0.0
        self._w = np.zeros(x.shape[1])
        self.misclassified_samples = []

        for _ in range(n_iter):
            # counter of the errors during this training iteration
            errors = 0
            for xi, yi in zip(x, y):
                # for each sample compute the update value
                update = self.learning_rate * (yi - self.predict(xi))
                # and apply it to the y-intercept and weights array
                self._b += update
                self._w += update * xi
                errors += int(update != 0.0)

            self.misclassified_samples.append(errors)

    def f(self, x: np.array) -> float:
        """
        compute the output of the neuron
        :param x: input features
        :return: the output of the neuron
        """
        return np.dot(x, self._w) + self._b

    def predict(self, x: np.array):
        """
        convert the output of the neuron to a binary output
        :param x: input features
        :return: 1 if the output for the sample is positive (or zero),
        -1 otherwise
        """
        return np.where(self.f(x) >= 0, 1, -1)

Iris data set

Iris data set is one of the most known and used data set for demonstration purposes. This data set is available at UC Irvine Machine Learning Repository in csv format.

Iris consists of 150 samples of flowers each described by 4 attributes (sepal length, sepal width, petal length and petal width). Samples are classified into three categories: Iris Setosa (rows 0-49), Iris Versicolour (50-99), Iris Virginica(100-149).

Firstly, let’s download the dataset.

import pandas as pd


url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
# download and convert the csv into a DataFrame
df = pd.read_csv(url, header=None)
df.head()

# result of df.head()
     0    1    2    3            4
0  5.1  3.5  1.4  0.2  Iris-setosa
1  4.9  3.0  1.4  0.2  Iris-setosa
2  4.7  3.2  1.3  0.2  Iris-setosa
3  4.6  3.1  1.5  0.2  Iris-setosa
4  5.0  3.6  1.4  0.2  Iris-setosa

In order to have a raw idea of how these samples are distributed we can simply plot them using matplotlib. Since there are four features describing each sample, we cannot plot all of them since it would be a 4d plot. Instead we restrict the plot the first three features.

import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d


# extract the label column
y = df.iloc[:, 4].values
# extract features
x = df.iloc[:, 0:3].values

fig = plt.figure()
ax = plt.axes(projection='3d')

ax.set_title('Iris data set')
ax.set_xlabel("Sepal length in width (cm)")
ax.set_ylabel("Sepal width in width (cm)")
ax.set_zlabel("Petal length in width (cm)")

# plot the samples
ax.scatter(x[:50, 0], x[:50, 1], x[:50, 2], color='red', 
           marker='o', s=4, edgecolor='red', label="Iris Setosa")
ax.scatter(x[50:100, 0], x[50:100, 1], x[50:100, 2], color='blue', 
           marker='^', s=4, edgecolor='blue', label="Iris Versicolour")
ax.scatter(x[100:150, 0], x[100:150, 1], x[100:150, 2], color='green', 
           marker='x', s=4, edgecolor='green', label="Iris Virginica")

plt.legend(loc='upper left')
plt.show()

Looking at the plot it seems that the points could be separated using planes. This becomes even more noticeable if we consider just the samples belonging to the first two categories and the first two features.

x = x[0:100]
y = y[0:100]

# plot Iris Setosa samples
plt.scatter(x[:50, 0], x[:50, 1], color='red', marker='o', label='Setosa')
# plot Iris Versicolour samples
plt.scatter(x[50:100, 0], x[50:100, 1], color='blue', marker='x', label='Versicolour')

# show the legend
plt.xlabel("Sepal length")
plt.ylabel("Petal length")
plt.legend(loc='upper left')

# show the plot
plt.show()
Now even a line is enough. Indeed, since the Perceptron is just a binary classifier we must consider only this second case. We will generalize to more classes later.

After that, we can now train and test the model. The data set is split into train set and test set and the model is trained with samples from the first one and tested against the other. The convergence of the model may depend on the choice of the test_size parameter. A train set too big compared to the test set may result in a model able to perform well only on the train set. On the other side a small train set may not provide enough data to instruct the model correctly. A typical value ranging from 0.2 (20% of the samples are assigned to the test set) to 0.3 is usually sufficient. The model stores in misclassified_samples the number of errors made during each training iteration so we can see how this value decreases in time.

from sklearn.model_selection import train_test_split


# split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25,
                                                    random_state=0)

# train the model
classifier = Perceptron(learning_rate=0.01)
classifier.fit(x_train, y_train)

# plot the number of errors during each iteration
plt.plot(range(1, len(classifier.misclassified_samples) + 1),
         classifier.misclassified_samples, marker='o')
plt.xlabel('Epoch')
plt.ylabel('Errors')
plt.show()

The number of errors is decreasing but not reaching zero. Indeed if we try to predict samples from the test set we achieve an accuracy of 97%. It sounds impressive but there is still some room for improvements.

from sklearn.metrics import accuracy_score
print("accuracy %f" % accuracy_score(classifier.predict(x_test), y_test))

Feature scaling

The reason why the model is still making a considerable error is due to the different scale of the two features: one feature weights more than the other distorting the output. Scaling a set means reducing it to a normalized distribution with zero mean and variance equal to 1. Given a feature $x_j$ we can compute its normalized value subtracting from it the mean of that feature $\mu$ and diving by its standard deviation $\sigma$:

$$ x_\text{j, std} = \frac{x_j - \mu}{\sigma} $$

Sklearn provides a class that applies scaling to our features array. Running again the model we can see that the error count goes to zero in just a few iteration and accuracy finally reaches 100%.

# plot the first feature before standardization
plt.hist(x[:, 0], bins=100)
plt.title("Features before standardization")
plt.show()

# standardization of the two features
x[:, 0] = (x[:, 0] - x[:, 0].mean()) / x[:, 0].std()
x[:, 1] = (x[:, 1] - x[:, 1].mean()) / x[:, 1].std()

# features after standardization
plt.hist(x[:, 0], bins=100)
plt.title("Features after standardization")
plt.show()

# split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25,
                                                    random_state=0)

# train the model
classifier = Perceptron(learning_rate=0.01)
classifier.fit(x_train, y_train)

# plot the number of errors during each iteration
plt.plot(range(1, len(classifier.misclassified_samples) + 1),
         classifier.misclassified_samples, marker='o')
plt.xlabel('Epoch')
plt.ylabel('Errors')
plt.show()
Feature before standardization
Feature after standardization

Decision region

To visualize in an intuitive way how the Perceptron algorithm work we can create a grid of values on the test domain set of the features. We ask the model to predict a label for each small rectangle so we can easily how the model classifies a sample according to its position in the domain.

def plot_decision_regions(x, y):
    resolution = 0.001
    
    # define a set of markers
    markers = ('o', 'x')
    # define available colors
    cmap = ListedColormap(('red', 'blue'))
    
    # select a range of x containing the scaled test set
    x1_min, x1_max = x[:, 0].min() - 0.5, x[:, 0].max() + 0.5
    x2_min, x2_max = x[:, 1].min() - 0.5, x[:, 1].max() + 0.5
    
    # create a grid of values to test the classifier on
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))
    
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    
    # plot the decision region...
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())
    
    # ...and the points from the test set
    for idx, c1 in enumerate(np.unique(y)):
        plt.scatter(x=x[y == c1, 0],
                    y=x[y == c1, 1], 
                    alpha=0.8, 
                    c=cmap(idx), 
                    marker=markers[idx], 
                    label=c1)
    plt.show()


plot_decision_regions(x_test, y_test)

Bonus: How the decision boundary changes at each iteration

We are going to slightly modify our fit method to demonstrate how the decision boundary changes at each iteration.

class Perceptron:
    def __init__(self, learning_rate=0.1, n_features=1):
        self.learning_rate = learning_rate
        self._b = 0.0
        self._w = np.zeros(n_features)
        self.misclassified_samples = []

    def fit(self, x: np.array, y: np.array, n_iter=1):
        for _ in range(n_iter):
            errors = 0  # errors during this iteration

            for xi, yi in zip(x, y):
                update = self.learning_rate * (yi - self.predict(xi))
                self._b += update
                self._w += update * xi

                errors += int(update != 0.0)
            self.misclassified_samples.append(errors)
        return self

    ...


classifier = Perceptron(learning_rate=0.001, n_features=2)
for i in range(0, 10):
    classifier.fit(x_train, y_train)
    accuracy = accuracy_score(classifier.predict(x_test), y_test)
    print("Iteration %d: accuracy %f" % (i, accuracy))
    plot_decision_regions(x_test, y_test)

Even if the result may change according to how the data set was split, we should be able to observe that Perceptron progressively moves the line separating the data to minimize the error.

TODO
TODO
TODO
TODO

Final code

"""Iris Perceptron

Automatically generated by Colaboratory.
"""

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d

class Perceptron:
    """
    Perceptron neuron
    """

    def __init__(self, learning_rate=0.1):
        """
        instantiate a new Perceptron

        :param learning_rate: coefficient used to tune the model
        response to training data
        """
        self.learning_rate = learning_rate
        self._b = 0.0  # y-intercept
        self._w = None  # weights assigned to input features
        # count of errors during each iteration
        self.misclassified_samples = []

    def fit(self, x: np.array, y: np.array, n_iter=10):
        """
        fit the Perceptron model on the training data

        :param x: samples to fit the model on
        :param y: labels of the training samples
        :param n_iter: number of training iterations 
        """
        self._b = 0.0
        self._w = np.zeros(x.shape[1])
        self.misclassified_samples = []

        for _ in range(n_iter):
            # counter of the errors during this training iteration
            errors = 0
            for xi, yi in zip(x, y):
                # for each sample compute the update value
                update = self.learning_rate * (yi - self.predict(xi))
                # and apply it to the y-intercept and weights array
                self._b += update
                self._w += update * xi
                errors += int(update != 0.0)

            self.misclassified_samples.append(errors)

    def f(self, x: np.array) -> float:
        """
        compute the output of the neuron
        :param x: input features
        :return: the output of the neuron
        """
        return np.dot(x, self._w) + self._b

    def predict(self, x: np.array):
        """
        convert the output of the neuron to a binary output
        :param x: input features
        :return: 1 if the output for the sample is positive (or zero),
        -1 otherwise
        """
        return np.where(self.f(x) >= 0, 1, -1)

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
# download and convert the csv into a DataFrame
df = pd.read_csv(url, header=None)
df.head()

# extract the label column
y = df.iloc[:, 4].values
# extract features
x = df.iloc[:, 0:3].values

fig = plt.figure()
ax = plt.axes(projection='3d')

ax.set_title('Iris data set')
ax.set_xlabel("Sepal length in width (cm)")
ax.set_ylabel("Sepal width in width (cm)")
ax.set_zlabel("Petal length in width (cm)")

# plot the samples
ax.scatter(x[:50, 0], x[:50, 1], x[:50, 2], color='red',
           marker='o', s=4, edgecolor='red', label="Iris Setosa")
ax.scatter(x[50:100, 0], x[50:100, 1], x[50:100, 2], color='blue',
           marker='^', s=4, edgecolor='blue', label="Iris Versicolour")
ax.scatter(x[100:150, 0], x[100:150, 1], x[100:150, 2], color='green',
           marker='x', s=4, edgecolor='green', label="Iris Virginica")

plt.legend(loc='upper left')
plt.show()

x = x[0:100, 0:2]  # reduce the dimensionality of the data
y = y[0:100]

# plot Iris Setosa samples
plt.scatter(x[:50, 0], x[:50, 1], color='red', marker='o', label='Setosa')
# plot Iris Versicolour samples
plt.scatter(x[50:100, 0], x[50:100, 1], color='blue', marker='x',
            label='Versicolour')

# show the legend
plt.xlabel("Sepal length")
plt.ylabel("Petal length")
plt.legend(loc='upper left')

# show the plot
plt.show()

from sklearn.model_selection import train_test_split

# map the labels to a binary integer value
y = np.where(y == 'Iris-setosa', 1, -1)

# standardization of the input features
plt.hist(x[:, 0], bins=100)
plt.title("Features before standardization")
plt.savefig("./before.png", dpi=300)
plt.show()

x[:, 0] = (x[:, 0] - x[:, 0].mean()) / x[:, 0].std()
x[:, 1] = (x[:, 1] - x[:, 1].mean()) / x[:, 1].std()

plt.hist(x[:, 0], bins=100)
plt.title("Features after standardization")
plt.show()

# split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25,
                                                    random_state=0)

# train the model
classifier = Perceptron(learning_rate=0.01)
classifier.fit(x_train, y_train)

# plot the number of errors during each iteration
plt.plot(range(1, len(classifier.misclassified_samples) + 1),
         classifier.misclassified_samples, marker='o')
plt.xlabel('Epoch')
plt.ylabel('Errors')
plt.show()


from matplotlib.colors import ListedColormap

def plot_decision_regions(x, y):
    resolution = 0.001
    
    # define a set of markers
    markers = ('o', 'x')
    # define available colors
    cmap = ListedColormap(('red', 'blue'))
    
    # select a range of x containing the scaled test set
    x1_min, x1_max = x[:, 0].min() - 0.5, x[:, 0].max() + 0.5
    x2_min, x2_max = x[:, 1].min() - 0.5, x[:, 1].max() + 0.5
    
    # create a grid of values to test the classifier on
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))
    
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    
    # plot the decision region...
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())
    
    # ...and the points from the test set
    for idx, c1 in enumerate(np.unique(y)):
        plt.scatter(x=x[y == c1, 0],
                    y=x[y == c1, 1], 
                    alpha=0.8, 
                    c=cmap(idx), 
                    marker=markers[idx], 
                    label=c1)
    plt.show()

plot_decision_regions(x_test, y_test)