By now, Deep Learning and Machine Learning have become synonymous with Artificial Intelligence. Neural networks seem to be what flour is to any baker. But it has not always been like this. In fact, there has been a period of time, where neural networks were considered useless for AI applications.
This post is part of “The history of AI”, which is a blog post series about scientific milestones and noteworthy people. If you are new here, check out the other posts, starting with “What is AI and where did it come from” and the blog about the big question “Can machines think?”.
With this post, I will introduce you to the perceptron. In its core, the perceptron is the smallest building block of any neural network. And also a good way to start, if one wants to understand neural networks. This blog post is also accompanied by a Google Colab Notebook. The perceptron implementation is quite straightforward and might help you understand the principle better. Play around with the code a bit to understand the individual parts. Of course, you will need to know at least some basic Python to understand the code. But the blogpost itself can be understood without any coding knowledge whatsoever.
The perceptron
The perceptron is an example of a binary classifier. It can learn to differentiate between two distinct classes. These classes need to be linearly separable. This means that if you would plot each of your data points on a graph, you need to be able to draw a straight line which separates the two classes. If that is not possible, the perceptron is not the right classifier for this type of problem (disclaimer: it still might be by using polar coordinates. But that goes beyond the scope of this article).
To fulfil all cheesy Christmas cliches with this post, let’s assume Santa wants to have a program telling him which child will deserve a present and which not. Of course, this is dependent on two different factors:
(1) naughtiness
(2) the cookies they placed on the plate for Santa.
Conveniently, this data is linearly separable. Our two classes are “gets present” (1) and “no present” (-1), so we can also check the binary classifier criterion.
The goal
Before we start looking at how we get a classifier for our Santa-problem, we should have a look at what we are trying to achieve here. Our data is linearly separable, which means that we are able to draw a straight line through the plot dividing the two classes. And this line is exactly what we want to get from the classifier. We need a function in which we can input a child’s naughtiness and cookie score and see if the child will get a present or not.
The setup
The perceptron requires a number of input values:
- ?(x),
- a set of initial weights ?0;
- and a learning rate.
The input values are the coordinates of our data points. For our example, this corresponds to naughtiness x1 and the cookie-count x2. Additionally, it is common to add a bias term x0 = 1 to the input weights, resulting in an input vector that has one more number than your amount of input values for one data point.
The initial weights are the first guess of what the function may look like. These come in the same form as the input vector with exactly one value more than the number of input values. You can use any value here. Some starting points may be smarter than others, but you will not always know. We’ll be using (0,0,0) here.
Lastly, we need a learning rate. If you are not familiar with learning rates, don’t worry and stick around. It will be much easier to understand when going through the algorithm.
The algorithm
From a broad perspective, a perceptron tries to find the function which correctly divides the data into its two classes by iteratively adjusting the function. Considering one datapoint per iteration, the algorithm checks if this specific point is correctly classified, by predicting its class with the weights. If so, it continues to the next point, keeping the current weights. If not, it updates them with the Perceptron Update Rule.
Okay, let’s add some math to this.
Prediction
To predict the class, we need two formulae.
As a first step, we calculate the dot product between the weights and our input vector. We do this by just multiplying each element of the input vector to its respective element of the weight vector (e.g. the first number in the input vector times the first value in the weights vector). And then we only have to add the resulting values.
If we were to take our first child with a naughtiness score of 2 and a Santa-cookie-rating of 1, the calculation would be the following:
Unsurprisingly, we end up at 0. If we plug this into the second function, we get f(a) = 1. But this is not right! This child was not meant to get any present (I mean, cookie rating of 1? Santa does not want to return there surely). So we need to update our function.
Updating
To update, we have to adjust the weights we’ll be using from now on. We do this with the following function:
?k+1 = ?k + ? × ?(xn) × tn
There are a couple of symbols here, let me quickly explain.
K is the current iteration, making ?k+1 the new weights we are just calculating and ?k those which just got us a wrong prediction. We have already seen ?, which is the learning rate, and ?(xn) is still our input vector. tn is the class that would have been correct for the current x.
Here you can see what the learning rate is for. It controls the impact our update has on the new weights. High learning rates cause big updates, small learning rates cause a smaller update.
We are now left with ?k+1, which will be the weights for our next iteration. We take the next data point, plug it into the formula from above, using the new weights:
And … wrong again. This child should have gotten a present. So we update our weights again
and use them for iteration 3:
and finally, we get a correct prediction.
But this does not mean that the next will be right again. This cycle has to continue until we (1) find a set of weights that does not change after we have used it for every single data point (so if we have found weights that predicts correctly for each of the children whether or not it will get a present) or (2) if we reach a certain stopping criterion. A stopping criterion could, for example, be a maximum number of iterations we want to loop through before stopping.
I used the Santa-helping-tool (Check out the Google Colab Notebook) to calculate the final weights:
? = (0, -0.1, 0.1)
We see that there was no change anymore after we updated ? the last time.
Now, what does this mean for Santa? Well, all he has to do now is take the data he has collected on every child and plug it into the formula from above with our weights.
The output will tell him if the child deserves a present or not.
We can transform the weights into a nice function and draw it into the graph from above to see what actually happened:
As the graph shows, our weights gave us the line which nicely separated the kids who will get a present from the poor suckers who won’t get any.
The perceptron and neural networks
Up to this point, it is maybe not really obvious how the perceptron and neural networks tie together. This becomes more apparent when looking at this graph:

What you see here is exactly what we did. The dot product of the input point and the weights are exactly the same as multiplying xn by ?n and summing all resulting values. In the final step, the sum is converted to an output value of either 1 or -1.
Going back to our Santa-example, x1 is our bias term of 1, x2 the child’s naughtiness, and x3 the cookie rating. The output, you will have guessed it, tells us whether or not the child will receive a gift.
If you are familiar with neural networks you see how this is what happens in neural networks on a fundamental level
For all those who are not, let’s have a look at this graphical representation of a simple artificial neural network (ANN). You can see that it is just a more complex perceptron, using the output of one perceptron as the input of another.
The history
Despite its rather simple mechanisms, the history of the perceptron is what makes it so interesting. The perceptron was first introduced by Frank Rosenblatt in “Principles of Neurodynamics” (1962). It combines two important works: McCulloch & Pitt Neuron and Hebb’s Rule.
The McCulloch & Pitt Neuron is a simple binary neuron that is able to do logical operations. It takes 1 or 0 as an input and performs Boolean operations with it. However, it does not have any weighting of the input nor is it able to learn as the perceptron does with the perceptron update rule.
The learning part of the perceptron comes from Hebb’s Rule. It describes a learning rule, which is commonly summarized as “Neurons that fire together, wire together”. The weights in the perceptron can be seen as the neural activity. Only if the activity reaches a certain threshold, the neuron fires.
Let’s take a quick step back to our Santa example. Here, “firing” means we get an output of 1 and a Santa-helper-elf jumps up to wrap a present. But in order for this to happen, the dot product of the child’s scores and the weights need to be bigger or equal to 0. Otherwise, the output of f(a) (remember, a is the dot product and f the second function for the prediction) will be 0. So our dot product needs to reach the threshold of 0 in order for our activation function (f(a)) to fire.
Let’s go back to 1962. Rosenblatt was extremely confident about his invention and claimed that it could solve any classification problem in finite time. Not all followed his enthusiasm. In 1969 Minsky and Papert published a book called “Perceptrons: An Introduction to Computational Geography”. It discusses the main limitation of the perceptron: it can only find a linearly separable function.
In our Santa-example the data was linearly separable (what a coincidence…). But not all, in fact, most real-life classification problems are not linearly separable. Minsky, therefore, concluded in the book that no significant advances are to be expected from studying the perceptron. This claim would put the research on perceptrons and neural networks into hibernation until the ‘80s.
Fast forward to now: neural networks are a fundamental part of today. This can be explained by an oversight of Minsky. Yes, one perceptron is not able to classify non-linearly separable data. However, if you connect several perceptrons into a neural network, it is able to solve complex classification problems and learn non-linear functions. This, together with backpropagation and hyperparameters (stories for another time), was what was needed to revive neural networks and give them a place in the history of AI.
The moral of the story (to stay in the Christmas spirit of this post)? You should better bake good cookies, it can fix a lot.