Unboxing the Black Box

We all have had experience with neural networks being black boxes. With artificial intelligence and machine learning becoming so prevalent in seemingly every new technology, it is hard to keep track of how it actually works. With image recognition being used in a wide variety of applications- from face tracking to disease detection to prediction of climate, let’s dive into one of the most widely used models used- convolution neural networks (CNNs). 

* This article is meant for readers who have some experience with Neural Networks and starting out with CNNs

I am not going to dive into too much history of CNNs, seeing as there are so many resources available on the internet to get acquainted with them. CNNs were first developed on recognising handwriting, they have become more advanced and are now used in facial recognition, medical imaging and augmented reality. But what does CNN do exactly? So briefly, convolution neural networks are used for image data; they take input images, and assign some importance to the various features of the image. These particular features can be a particular shape in the image, or edges, or objects in the image. The image is just a series of intensities of light (represented by pixels) arranged in a grid-like fashion. Similar to the human eye, a CNN, with the help of multiple layers, detects the shapes of the images, simpler patterns first, and more complex patterns later. They try to guess what a given input image actually is. The model used in the attached notebook will guess if we are giving it a cat or a dog. 

The image above shows our model cat; the image is divided into a grid-like structure of varying intensities. 

The CNN is typically made up of 3 blocks: the convolution layers, the pooling layers, and the fully connected layers. Let’s go through them one by one. I am not going into the maths of each layer, just the intuition behind them. Interested readers can find the maths in the linked resources at the end of the article. 

The convolution layer: downsampling the data

Immediately after saying that we won’t delve into mathematics too much, it rears its head in the next sentence. The convolution layer is a series of matrix multiplication operations. The image grid with the pixel intensities is partitioned by an operation called stride. These matrices are multiplied with another matrix, called kernels. These kernels contain weights which will be multiplied by the image pixel intensities (remember the Neural Network formula: output = input*weight + bias? This is exactly what is happening here, but with image data). The figure below shows how the strides work to form matrices for multiplication. Note that each square corresponds to pixels with some intensity values. Did you notice how there are empty squares around the output we created from the matrix multiplication? That happens because we effectively reduce the dimensions of the matrix through the multiplication operation. This is what convolution does. If we want to preserve the shape of the matrix, we fill the border values with zeros; this operation is called padding. We have shown a 2D array, for the RGB colour system, the same operation is done, but for 3 dimensions. 

The pooling layer: Survival of the fittest

Imagine you have a 7-course meal, all laid out in front of you. You want to taste it all without getting full. What you do is instead of eating a course first and then moving on towards the others, you take a small bite of each course. You take sips of water/wine in between the bytes to have a fresh taste in your mouth for your next bite. You are basically sampling all the courses. That small bite you took from each dish is representative of the entire dish. Different bites of the same dish are going to taste the same. You can judge how the food tasted just by those representative bites you took. That way, you can taste all the dishes in front of you without getting too full and leaving some out. This is exactly what the pooling layer does, but with image pixel intensities. It starts from one region of the image, takes a few representative values (either maximum or average from a region), and uses them in the final matrix. 

Fully Connected Layer: It’s all coming together

We have downsampled the data, applied our own operations on it so that the neurons in the network contain real information about the images, but they are all individuals, right? We still need to bring them together to produce outputs. Just like in a neural network, we connect everything together to the final layer. This is the job of the fully connected layer. The output from the convolution and pooling layers now contains vectors of values, each value representing the probability that a certain feature belongs to the wanted label. Remember the 7-course menu? The fully connected layer will probably contain an Apple pie and contain features like ‘apple-flavour’, ‘sweet’, ‘cinnamon’. If you eat another dessert without knowing what it is, and taste these features, you can safely say that what you’re eating is probably an apple pie. Good job, you’ve connected it all together!

But what about the activation functions?

The activation functions convert the linear operation of convolution, into a nonlinear operation. This is because the data we have, (the features of the image), is not linear. The activation functions are used after the convolution operation is done. The mathematics behind them can be found in the resources listed below.

When dealing with image data, we repeat the convolution and pooling layers multiple times. This helps in decomposition of the input values. We use a variety of filters, some filters operate on shapes and lines, while some filters work directly on the pixel intensities. Using multiple convolution layers helps in using a  variety of filters, to better classify the features of the images. But remember, deeper is not always better! 

The basic architecture of a convolution neural network can be something like this

Input Image -> Convolution1 -> Activation -> Pooling1 -> Convolution2 -> Activation -> Pooling2 -> Fully Connected Layer -> Output

Convolution Neural Networks Explanation visualised 

The post talked about the architecture of convolutional neural networks and their working. The next part will show how each layer works with the help of feature maps. The code used to visualise this can be found in the references. 

I have created a CNN using a series of 3 convolution layer-pooling layers with a final fully connected layer, which classifies cats and dogs. Through the series of images next, you will see how CNN understands the features of images. But first, a picture of a cute cat. 

This is the cat model I am going to use for the CNN* 

Remember how the first few layers of the CNN detect the edges and very basic shapes? This is what the cat looks like, to the very first channel for CNN. 

Notice the thin lines? Yes, those are the borders in the pictures recognised by the architecture. The picture doesn’t tell us very much, because it’s just the first filter. The layers closer to the input images show us a lot of detail. As we go deeper, it becomes difficult for the human to perceive the shapes recognised by the architecture. In other words, the first few layers capture a lot of details and small patterns. The deeper layers can identify the general patterns in the images. The following image, of the 3rd layer of the CNN, is almost indiscernible to humans eyes. This cat almost looks like the horror-show from the thiscatdoesnotexist page. But now, the CNN is learning the general patterns of the image, not focusing on the details.


The fifth layer of visualisation is even further away from our original cat. By now, we can barely see the ears and the paws of the cat. But the general shape of the cat is important to the CNN. 

With multiple layers, and a deep architecture, the model learns these generalised features and can reproduce these features very well when identifying image labels. In the notebook, there are some black filters as well, that just means that those filters were not activated. As we go deeper, the activations are much less discernible visually, but still retain the major features. 

This post was intended to open the black box of image classification algorithms a little. The visualisations really tell us step by step how the convolution neural network reads and understands an image. Of course, there is a lot of mathematics involved in the construction of CNNs, which I did not explain, to preserve the actual motivation of this article. The list of references will help if you are more interested in the background.  

Notebook used for this article

Visualizing filters for CNN using VGG19

Picasso, a free open source visualiser for CNNs 

Deep Visualization Toolbox

The maths behind Activation Functions

*No animals were harmed during the writing of this post.