Convolutional Neural Networks for Image Recognition

Convolutional Neural Networks (also called CNN or convnets) are a type of neural architecture that is used for learning data in grid-like structures. Data such as images, audio, videos etc.

The massive advancement in computer vision in our world today can be attributed to the application of Convolutional Neural Networks. We now have devices with face unlock systems, Google Lens can do object detection and Tela now makes self-driving cars.

In this tutorial, we will give a gentle introduction to the Neural Network architecture and why it is used for image data. We will discuss the building blocks of convnets architecture and go ahead to build a CNN that recognizes handwritten digits.

By the end of this tutorial, you will learn.

What Images are Made up
Why can’t we use ANN for images?
The ImageNet Challenge
The Architecture of Convolutional Neural Networks
Different Convolution Operations
Building a Convolutional Neural Network

Let’s begin with how images are represented.

What Images are Made up

Images are simply a collection of numbers in grid form. They are a 2-dimensional grid of pixels. Thus, if you have say, an image of resolution 33 by 32, there are 32 rows and columns of numeric numbers. This is the case of black and white images.

For colored images, the grid of numbers is split into different color schemes such as the RBG, CMYK scheme, etc. In a 32×32 RGB image, for instance, the image contains 3 layers stacked together (the red, blue, and green layers).

Two properties make images unique – locality and translation invariance.

Locality is the property that the same patterns can be observed in different parts of the images. In a picture of a house with 2 same windows, the same pattern will be observed at each instance of the windows. Of course, this is because both windows are made of the same materials. Furthermore, they have the same width, height, and other properties.

Translation invariance is the property that even when the appearance of an image varies, it can still be recognized as that same image. Still sticking with the house scenario. Let’s say you took two different shots of the same house. There is a high likelihood that the first picture of the house is not exactly the same as the second shot. The angle of the shot, the contrast, the zoom, etc may be different. Notwithstanding, it is still the same house. Translation invariance means that even though the two images were altered in some way, the object is still the object.

Why can’t we use ANN for images?

We know that ANN receives a numeric input to be transformed to produce the output. Since images are 2D pixels of numbers, why can’t we concatenate the rows to one row and pass the numbers into an ANN network? This looks like a good idea. In fact, it would work. However, the model may not learn the data efficiently. We have mentioned that images have locality, meaning that the same pattern is observed at different pixels. CNN will assign the same weight whenever these patterns are observed, expediting the rate of learning.

If the pixels were to be concatenated in a row, we would lose these patterns because the arrangement of the pixels has been changed. This will make it difficult for the model to learn fast.

Another reason a flattened pixelated image cannot be passed into an ANN is due to the second properties of images – translation invariance. If the pixels were flattened to have one layer, a slight change in the image produced entirely different tensors. The ANN would not be able to classify the image correctly due to the slight change in the image. Thus the translation invariance properties are lost in an ANN.

The ImageNet Challenge

From 2010 to 2017, there was an annual computer vision challenge where individuals and teams were to build models to classify images from the ImageNet dataset. It was this challenge that opened our eyes to the potentials of CNN. The ImageNet dataset contained over 1.4 million images with 1000 classes. The results were presented in an annual computer vision conference to showcase the best techniques and architecture to build computer vision models.

In 2010 and 2011, researchers used native computer vision techniques to discover features in a model and build models that classify according to such patterns. The best results were about 27%. But in 2012, Alex Krizhevsky led a team to develop a classifier that produced tremendous results. Alex used a convolutional neural network to classify the classes and had an error rate of 15%. This was way beyond the results previously gotten and opened the eyes of everyone to the power of CNNs. Alex’s popular neural network is popularly called AlexNet.

Convolutional Neural Networks for Image Recognition — *Source: Research Gate*

Following the remainder result from ALexNet, virtually all entries in 2013 were CNNs. In 2013, Matthew Zeiler and Rob Fergus did a little variation to the AlexNet model and won the result for that year. VGGNet won the competition in 2014 with an error rate of 7%. The following year, Microsoft Researcher Kaiming He, et al. developed a model that applied ensembles of residual networks to accomplish an astonishing arrow of 3.57%. Their network was called ResNet. At this point, the result began to saturate and by 2017, the competition was called off since the models already produced results way above what humans can do.

The Architecture of Convolutional Neural Networks

A convnets architecture is made of four layers:

Convolution layer
Nonlinearity or Activation Layer
Pooling layer
Fully connected layers

Let’s explain each of these.

The Convolution layer

The convolution layer is where the feature detection of an image is done. We have said that images have specific patterns. In the convolution layer, these patterns are studied and learned such that they are recognized anywhere else in the picture. But how does the convolution layer do this feature recognition?

A locally connected unit usually 3 by 3 in structure, called the feature map, slides through the grids of the image, returning an output. There can be various feature maps for each convolution layer. One feature map can determine the edges, while others determine the shadow, color, gradients, etc until the image is fully understood. The feature maps are usually multiplied by some weight to return an output. These weights associated with the feature map are called a kernel or a filter.

G-CNN: Object Detection via Grid Convolutional Neural Network | Semantic Scholar — *Source: Semantic Scholar*

Remember the kernel slides across the image pixels to return an output value for that position. This process continues until the entire image has been passed through. The table below shows an example of how each filter does a particular task and passes the output to the next filter that builds on it.

Different Convolution Operations

Valid convolution: In this type of convolution, the kernel slides on the image grid such that it only returns an output where it can fully overlap the image. When using valid convolution, the output size of the image is reduced by the formula:

Output size = Input size – Kernel size + 1

Say you used a 3 by 3 kernel on a 27 by 27 image, the output image size after convolution is 27 – 3 + 1 = 23. Therefore, the image output size will be 23 by 23.

Full convolution: In full convolution, the output is computed where the kernel and feature overlap by at least a pixel. To achieve this, the image grid edges are padded with zeros before the kernel sweeps from the top left corner to the bottom right corner in a left to right manner. In this situation, the output size is usually larger than the image size.

Output size = Input size + Kernel size – 1

Same convolution: In this variation of convolution, the image grids are padded with just enough zeros to ensure that the image size is the same as the output size. This is a much more desirable result, which is why the same convolution is a popular variation of convolution. It is important to point out that padding is symmetrical only if your kernel is an odd size. Say a 3 by 3 kernel. If you decide to use an even kernel, say a 4 by 4 kernel, you will have to pad one size of the image more than the other side (asymmetrical padding) to achieve the same convolution.

Output size = Input size

Strided convolution: In this variant, some image grids are intentionally skipped. In other words, the kernel does not slide over all possible overlap. This method is good for building a light model as it is computationally less demanding.

The Pooling Layer

In pooling, the statistical aggregation of each image is computed for each kernel slide. The most common aggregation done is the mean or the maximum value from the image grid.

Pooling is in reality another variant of convolution. It is however a very useful layer, which is why it is being used alongside other convolution layers. Just as in strided convolution, the pooling convolution reduces the output image size significantly.

The pooling layer is used as a separate layer alongside other convolution layers because it reduces the complexity of the image especially in situations where the image size is large. On top of that, the pooling layer helps to reduce the noise from the convolution layer.

Schematic representation of a convolution-and pooling layer in a CNN. | Download Scientific Diagram — *Source: Research Gate*

For instance, the convolution layer is meant to learn edges. The edges may not be distinctly clear after the convolution layer. Using a Max Pooling layer after the convolution layer will output the denser grids (the edges) and fade out grids that are less dense (the noise).

Nonlinearity

Since the convolutional layers perform a linear operation, it is vital to convert the architecture into a nonlinear system. This is what the non-linear layer does. In practice, activation functions are added on top of the convolutional layer. There are a couple of non-linear operations that can be done. The commons being the relu activation function, sigmoid activation function, the tanh activation function, softmax, etc. For a detailed discussion on activation function, refer to the previous tutorial on ANN.

A Fully Connected layer

After the convolution layer and pooling layers, it is important to pass the data into a fully connected layer with a number of dense neurons. This can be done by flattening the output of the convolution layer and creating a fully connected layer. The fully connected layers further add nonlinearity to the system. As a general rule, more nonlinearity means a more complicated system. The system accuracy however after a certain number of layers. AlexNet for instance had just 8 layers.

Apart from adding more nonlinearity to the architecture, the fully connected layers also create the platform to add the final layer that returns the labels.

Building a Convolutional Neural Network

In this section, we shall be building a CNN and feeding it with data for the purpose of classification. Specifically, the MNIST dataset will be used for this assignment. MNIST dataset is a popular dataset to build a simple image classifier. The dataset contains 60000 instances of train data and 10000 instances of test data. Features of the data are pixelated images of hand written numbers from 0 to 9. The label being the number written.

Here, we will build an image classifier using CNN with Keras. The dataset comes with Keras dataset function. Let’s import the data. We start by importing the necessary libraries. After loading the data, we split it into different chunks for training and testing.