Understanding Artificial Neural Networks, with Example using Keras.

The major breakthrough in Artificial intelligence started with the development of Artificial Neural Networks (ANNs). The native machine learning algorithms are great but are limited as the data gets larger. They are not the best methods to learn from big data. With ANN, we can build models that can find hidden patterns in complex and big data. It may interest you to know that many of the complex applications of artificial intelligence in our day to day activities, make use of Artificial Neural Networks or its other forms Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN).

In this tutorial, we will talk extensively about ANN and its architecture. We will also demystify the many fancy words in this field and finally go ahead to build a neural network with Keras. Specifically, here’s what you’d learn by the end of this tutorial.

What are Artificial Neural Networks (ANNs)
The Architecture of ANN
How the ANN Learns
Keys Concepts in Artificial Neural Networks
Activation Functions
Examples of Activation Functions used in Building ANN
Strategies to Deal with Overfitting
Building an Artificial Neural Network Using Keras

Let’s blast off.

What are Artificial Neural Networks (ANNs)

An Artificial Neural Network is a form of supervised learning where some data is fed into an element called perceptrons (also called neurons, nodes or units) and returns an output. A perceptron can be seen as a black box that receives an input, carries out some transformation, and returns an output.

Understanding Artificial Neural Networks, with Example using Keras.

A perceptron

Source

The Architecture of ANN

ANNs architecture is inspired by how the human brain is structured. The brain contains billions of neurons each firing signals to one another. Each brain neuron receives input from the dendrites, processes the input in the cell body, and passes it along to the next neuron with the axiom.

Source: Stack Exchange

In the same vein, the perceptrons are usually connected in layers where one makes a simple decision and outputs the result to the next perceptron. The perceptrons can in fact be arranged in layers. Any layer between the input and output layer is called the hidden layer.

An ANN that has just one hidden layer is called a Single Layer Perceptron or a shallow neural network. On the other hand, an ANN with more than one hidden layer is called a deep neural network. In case you’ve been wondering where the deep in deep learning came from. You have your answer. Most ANNs are built with more than 1 hidden layer, making the network deep. Generally speaking, deeper is better.

How the ANN Learns

The ANN is structured like that human brain and thus, learns more like how humans learn. If you’re about to learn a new concept and straight out of the gate, you decide to take a test. There’s a very high likelihood that you’d fail such a test. Let’s say you then then decided to do some reading up. You picked up a textbook to read on the said concept. You even took a step to see how 2 worked examples are done. You afterwards attempt the same quiz. You definitely would perform better than in your first attempt.

Let’s say you got really adventurous now. You’re still not happy with your outcome so this time you decided to devour any material whatsoever in that concept. You practised with a lot of solved examples before attempting the quiz. Now, you should perform way better than in the first 2 attempts. Apparently, you would. Notice that your result was better as you got exposed to a lot of solved problems and tested yourself. This is exactly how an ANN works.

The network called connected perceptrons would require data (the textbooks), an optimizer (studying more) and a loss function (the test). The network takes input data and pushes the same into a fully connected layer. Each layer processes the data with what is called an activation function. It then pushes the output to the next layer or to the output layer.

Source: Research Gate

At the point of getting to the last layer, it returns the final result. Consequently, the network checks how well it has performed using a defined loss function. It then goes back to each layer to adjust the weight of the activation in a bid to reduce its loss function. The network determines the next suitable weights using an optimizer. The process of going back to each layer is called backpropagation. It continues these processes until it gets to less possible loss. When it arrives at the lowest possible loss, it means the ANN has learned to the fullest.

I know there have been a lot of fancy words and truthfully each step is more intricate than has been described. Next, we’d discuss some of the concepts you have to understand as regards ANN.

Keys Concepts in Artificial Neural Networks

Input data

Data is the crux of any machine learning model. The input data is the data fed into the model for training, with the aim of spitting out some output. Thereafter, another fresh input data is fed into the model for testing. The input data can be called features or in some cases, independent variables. The input data goes into the network following the equation, wx + b. where w is the weight, x is the input feature and b is the bias.

As explained above, the input data is usually split into two, training and testing. The training data usually contains the output data while the test data do not. Let’s discuss the output data next.

Output data:

This is the result returned by the output. During the training process, the output data (otherwise called the target or the dependent variable) is fed alongside the input data. The model thus learns how the features combine to produce the output. During the testing stage, the output data is hidden from the data. This time, it is expected that the model makes predictions of the output data. The output data could be a set of real numbers or booleans.

Perceptron

This is the fundamental unit of an ANN. The perceptron (also called the neuron or node) receives some input and returns some output. It utilizes the activation function when processing the inputs to outputs. There is a myriad of activation functions. Examples include sigmoid, relu, tanh etc. All these will be discussed later in this tutorial.

Weights

A perceptron basically transforms an input into output by multiplying the input data by weights. Weights are simply numbers that affect the outputs of each perceptron. During the process of improving a model’s performance, the model basically finds the best combination of weights for each perceptron that returns a lower loss.

Feedforward pass

This is the process of the model taking the input data and passing it to each perceptron until it returns output to the user. The perceptron receives the input, processes them through an activation function, and returns an output. The output goes into the next layers of the Neural Network architecture. They now become the input data that returns output for the layer. This process continues until it gets to the final layer called the output layer. At this point, it generates an output to the user.

Loss function

After a feed-forward pass is complete, the model compares its output to data actual Output. The loss function (also called the error function) measures how desperate the predicted value is from the actual value.

Optimizer

The optimizer function is basically to adjust the weights of the network so that the loss depreciates. There are a couple of optimizers out there. Examples include the very popular Adam optimizer, Adagrad, momentum optimizer, etc.

Backpropagation

A neural network’s aim is to ensure its predicted value and the actual value are as close as possible. After the first feed-forward process, this is not usually the case. In order to improve its result, the model must adjust its weight. This is done during the backward propagation. The backpropagation checks each neural and computes the derivative of the loss function wrt its weights. This helps to find the weights that would decrease the loss function. This process of optimizing the weights is mathematically called gradient descent. The process of moving back each neuron is called backpropagation.

Epoch

An epoch is basically the term used to indicate a complete feed-forward pass and backpropagation step. You may also see it as the number of times the model sees your data to adjust its weight. Generally, the model learns more as the number of epochs increases until a particular threshold. Thereafter the performance of the model stabilizes.

Bias

Bias is one term you’d frequently hear. The bias measures how well the model has been trained. During training, the model compares its predicted result with the actual result. The model is said to have high bias if the predicted result is close to the actual result. The bias is low if otherwise.

Variance

After the training process, it is important to check your models performance on completely unseen data. The variance measures how well the model performs when fed with data that is not in the input data. The model has high variance if it’s predicted value for unknown data is close to the actual value. The variance is low if otherwise.

Underfitting

A model is said to have underfitting if it has low bias and low variance. In other words, the model doesn’t perform well during training. And by extension, it doesn’t perform well for unknown data. This is a serious problem in machine learning. It means your model does not learn. If faced with this, you may want to tweak your model or perhaps add more training data if the input data was small.

Overfitting

Overfitting occurs when the model has high bias and low variance. Putting it differently, the model performs impressively well during training but fails woefully when it is greeted with unknown data. Technically speaking, we say the model does not generalize well. During overfitting, the model has gone beyond learning patterns to now learning your training data verbatim. And that includes the noise in the data. You can see it as a student cramming a mathematical solution from a textbook and thinking he understands that concept. If such a student is confronted with a slightly different problem, he definitely will fail beautifully.

Over is a much more common sitting in contrast to it’s underfitting counterpart. Later in this tutorial, we will touch ways to lessen the occurrence of overfitting in our model.

Hyperparameters

Hyperparameters can be seen as settings that are changed to affect the behavior of your models. It is good practice to tune hyperparameters to suit the data you’re working with per time. Examples of hyperparameters to tune include the kind of activation function to use, the number of perceptrons in a layer, the number of hidden layers, the number of epochs to train for etc.

Now let’s take a deeper dive into activation functions

Activation Functions

An activation function is pivotal in neural networks. They determine the output returned from each neuron, the possibility of training a network that converges after a number of epochs and the computational efficiency in general. Activation functions can be seen as mathematical functions that map input to outputs. Most outputs from activation function returns output small numbers mostly between -1 and 1.

Why do we need Activation Function

You may be wondering why exactly you need activation functions. Activation functions are important for two major reasons.

Faster computation: activation functions first helps us restrict the output from a neuron. For a feature x, the input value into the neuron is given by w*x + b. Where w is the weight and b is the bias. If the value of x is high, the computation of w*x + b will likewise by high. Imagine we have tens or hundreds of neurons across different layers connected together and dealing with such high numbers. It would be computationally expensive to complete just one feed forward step. The small numbers inputted by activation functions allows for faster computations even with millions of parameters involved.
Permits non-linear transformation: This is perhaps the most important reason for an activation function. It allows you to map the input to output in a non-linear way. Just as we discussed in the tutorial about Kernels, most real life situations are non-linearly connected. Life would have been easier if you could predict whether someone has malaria by adding his age to the number of siblings he has. This is however not the case. The independent features are connected to the dependent feature in a non-linear manner. We thus need to find ways of mapping input to output in a non-linear manner. That is what activation function does. Moreso, the many connected neurons makes the non-linearly mapping even more powerful. A neuron builds upon the output of the previous neuron, allowing for a higher degree of complexity.

Let’s briefly touch some of the most common activations functions used in ANN.

Activation Functions used in Building ANN

Sigmoid function: The sigmoid function, also called the logistic function is an S curve that is mostly used for binary classification. The curve to the function is shown below.

Source: Wikipedia

As seen in the above curve, it returns a value between 0 and 1. Typically, values above 0.5 belong to the first class while values less than 0.5 belong to the other class.

Sigmoid function however has some drawbacks. First, it is computationally demanding and also leads to the vanishing gradient problem. If you don’t know what the vanishing gradient problem is, let me explain it in a few lines.

Vanishing gradient occurs when the input from previous neurons becomes so small that during backpropagation, the differential of the input tends to zero. When this continues after some epochs, the neuron mostly shuts down, and learning stops.

Another drawback of the sigmoid function is that it is not centered at zero. This is not desirable as it can quickly shift the input distribution to one direction.

Softmax function: This is seen as a more general form of the sigmoid function. It is used for multiclass classification problems. Just as it is in the sigmoid function, the softmax function also leads to vanishing gradient problems. It is however useful in the final layer of classification problems.
Tanh function: The hyperbolic tanh behaves like the sigmoid function but the function solves the zero-centered problem.

Source: Wikipedia

Rectified Linear Unit (ReLU): This is perhaps the most common activation function out there. It is defined by f(x) = max(0, x). In plain words, the ReLU changes all negative inputs to 0 and leaves positive numbers as it is. The ReLU activation function is computationally efficient and does not lead to the vanishing gradient problem. ReLU activation function however has its demerits. Because ReLU converts all negative inputs to zero, neurons can completely die during training. This is called the dying ReLU problem.
Leaky ReLU: The Leaky ReLU activation function helps to solve the dying ReLU problem. The function is given by f(x) = max(αx , x). Where α is a small number, say, 0.01. With leaky ReLU, neurons are not shut down completely.

Before we delve into building our neural network, it is important to discuss how to deal with one major problem when building an ANN or any neural network architecture – overfitting.

Strategies to Deal with Overfitting

As discussed earlier, overfitting occurs when a model is trained on a data with a small loss but when met with new data, the loss is blatantly high. When this happens, we say the model has a generalization problem. In practice, such models cannot be used. How then do you reduce overfitting? The process of reducing the possibility of overfitting is called regularization. Let’s discuss some of the strategies that can be used.

The size of the network: Yes, deeper is better. However, if you increase the number of layers to some point, the network begins to learn every detail of the data, including noise. You should try to use the optimum number of layers and by extension the number of nodes in each layer.
Dropout: This technique can come in handy when you have a lot of layers and units. Dropout involves shutting down some layers randomly by setting its weight to 0. When building the model, a parameter called the dropout rate is defined to indicate the percentage of nodes you want dropped out.
Weight regularization: This can be seen as a limitation placed on a network such that it does not have weights of very high values. The constraints is added to the cost function and these constraints are of two types
- Ridge (or L1 regularizer): This adds the square of the value of the weight coefficient value. The new cost function becomes

C.F(new) = C.F + α(slope)²

Lasso (or L2 regularizer): This adds the absolute value of the weight coefficient values. The new cost function becomes

C.F(new) = C.F + α|slope|

Armed with all this knowledge, let’s now build an ANN using Keras.

Building an Artificial Neural Network Using Keras

In this section, we will develop an ANN model that predicts whether a customer should be given a credit card or not. The dataset obtained from the UCI Machine Learning Repository can be downloaded here. The process of building the model will be divided in 5 steps

Steps 1: Import the Data

Step 2: Exploratory Data Analysis

Step 3: Feature Engineering

Step 4: Model Building

Step 5: Training and Evaluating the Model

Step 1: Import the Data

Of course, we will need to import the data. We do this using the pandas library. Since we will also be making use of other libraries, later on, let’s import all the necessary libraries right now.

#import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

#import that data
df = pd.read_csv("Bank_Personal_Loan_Modelling.csv")

Steps 2: Exploratory Data Analysis

Let’s see what the data looks like. We print the first 5 rows.

#print the first 5 rows of the dataframe
df.head()

Output:

  ID  Age  Experience  Income  ZIP Code  Family  CCAvg  Education  Mortgage  \
0   1   25           1      49     91107       4    1.6          1         0   
1   2   45          19      34     90089       3    1.5          1         0   
2   3   39          15      11     94720       1    1.0          1         0   
3   4   35           9     100     94112       1    2.7          2         0   
4   5   35           8      45     91330       4    1.0          2         0   

   Personal Loan  Securities Account  CD Account  Online  CreditCard  
0              0                   1           0       0           0  
1              0                   1           0       0           0  
2              0                   0           0       0           0  
3              0                   0           0       0           0  
4              0                   0           0       0           1

It’s good practice to check the statistical properties of each column using the describe method. #check the descriptive statistics of the features print(df.describe())

Output:

               ID          Age   Experience       Income      ZIP Code  \
count  5000.000000  5000.000000  5000.000000  5000.000000   5000.000000   
mean   2500.500000    45.338400    20.104600    73.774200  93152.503000   
std    1443.520003    11.463166    11.467954    46.033729   2121.852197   
min       1.000000    23.000000    -3.000000     8.000000   9307.000000   
25%    1250.750000    35.000000    10.000000    39.000000  91911.000000   
50%    2500.500000    45.000000    20.000000    64.000000  93437.000000   
75%    3750.250000    55.000000    30.000000    98.000000  94608.000000   
max    5000.000000    67.000000    43.000000   224.000000  96651.000000   

            Family        CCAvg    Education     Mortgage  Personal Loan  \
count  5000.000000  5000.000000  5000.000000  5000.000000    5000.000000   
mean      2.396400     1.937938     1.881000    56.498800       0.096000   
std       1.147663     1.747659     0.839869   101.713802       0.294621   
min       1.000000     0.000000     1.000000     0.000000       0.000000   
25%       1.000000     0.700000     1.000000     0.000000       0.000000   
50%       2.000000     1.500000     2.000000     0.000000       0.000000   
75%       3.000000     2.500000     3.000000   101.000000       0.000000   
max       4.000000    10.000000     3.000000   635.000000       1.000000   

       Securities Account  CD Account       Online   CreditCard  
count         5000.000000  5000.00000  5000.000000  5000.000000  
mean             0.104400     0.06040     0.596800     0.294000  
std              0.305809     0.23825     0.490589     0.455637  
min              0.000000     0.00000     0.000000     0.000000  
25%              0.000000     0.00000     0.000000     0.000000  
50%              0.000000     0.00000     1.000000     0.000000  
75%              0.000000     0.00000     1.000000     1.000000  
max              1.000000     1.00000     1.000000     1.000000

You can quickly see some interesting information. The average age lies at 45 years, with an experience of 20 years and an income of 73.7k USD. The maximum salary earned was at 224k USD while the lowest was 8k USD.

Now, let’s see if the data contains null values.

#check for null values 
df.isnull().sum()

Output:

ID                    0
Age                   0
Experience            0
Income                0
ZIP Code              0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal Loan         0
Securities Account    0
CD Account            0
Online                0
CreditCard            0
dtype: int64

As seen, no feature contains missing values which is good

Step 3: Feature Engineering

Now we split the data into targets and features. The ID column is dropped because it is unique for all the samples. Afterwards, the label is encoded using the LabelEncoder() class to return binary output, 0 and 1. Furthermore, it is imperative to split the data into train and test data with a test size of 0.2. Going forward, the distribution of the data needs to be standardized to prevent any form of bias in the model. We standardize the data using the StandardScaler() class of sklearn.

#split the data into target and features
target = df.CreditCard 
features = df.drop(['ID', 'CreditCard'], axis=1)

#encode the dependent variable (label)
encoder = LabelEncoder()
target = encoder.fit_transform(target)

#split the data into train and test data
X_train, X_test, y_train, y_test = train_test_split(features,
                                                   target, test_size=0.2, random_state=42)

#standardize the independent features
scaler= StandardScaler()  
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 4: Model building

Keras provides a fast and efficient way of building ANN architectures. It is important to point out that the best architecture cannot be necessarily known. Various architectures should be tested and the best selected. Here, we will point to an ANN with 2 hidden layers. The first layer would have 12 nodes (or neurons) with a relu activation function. The second hidden layer would have 5 nodes with the relu activation function as well. The final layer of course is the output layer with just one node.

The model was compiled with the popular Adam optimizer and a binary cross entropy loss was used since it is a binary classification problem. Also, we tracked the accuracy metric during training. The following code builds the ANN architecture as explained above.

#build the ANN architecture
model = Sequential()
model.add(Dense(12, input_dim=12, activation='relu'))
model.add(Dense(5, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))

#compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Notice that the input dim was set to 12. This is the number of nodes the input layer would have. Since the day has 12 input features, the input dim argument should be 12 so that each feature goes to one node. Also, notice that a dropout layer was added to the 2^nd hidden layer. As explained earlier in this tutorial, this technique prevents overfitting.

If you wish to see the graphical representation of the model, you can use the summary() method.

#check the model architecture
model.summary()

Output:

Layer (type)                 Output Shape              Param #   
=================================================================
dense_34 (Dense)             (None, 12)                156       
_________________________________________________________________
dense_35 (Dense)             (None, 5)                 65        
_________________________________________________________________
dropout_9 (Dropout)          (None, 5)                 0         
_________________________________________________________________
dense_36 (Dense)             (None, 1)                 6         
=================================================================
Total params: 227
Trainable params: 227
Non-trainable params: 0
_________________________________________________________________

Step 5: Train and Evaluate the Model

After building the model architecture, we need to train the model on the data. This is done by calling the fit method and passing the training dataset as parameters. The test data was used as the validation data and the training was done for 200 epochs.

#train the model
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=200)

Output:

Train on 4000 samples, validate on 1000 samples
Epoch 1/200
4000/4000 [==============================] - 1s 287us/sample - loss: 0.7470 - acc: 0.5378 - val_loss: 0.6257 - val_acc: 0.6850
Epoch 2/200
4000/4000 [==============================] - 0s 91us/sample - loss: 0.6329 - acc: 0.6670 - val_loss: 0.5935 - val_acc: 0.7390
Epoch 3/200
4000/4000 [==============================] - 0s 85us/sample - loss: 0.6094 - acc: 0.7122 - val_loss: 0.5756 - val_acc: 0.7460
Epoch 4/200
4000/4000 [==============================] - 0s 88us/sample - loss: 0.5906 - acc: 0.7270 - val_loss: 0.5668 - val_acc: 0.7490
Epoch 5/200
4000/4000 [==============================] - 0s 97us/sample - loss: 0.5790 - acc: 0.7305 - val_loss: 0.5619 - val_acc: 0.7480
Epoch 6/200
4000/4000 [==============================] - 0s 95us/sample - loss: 0.5861 - acc: 0.7340 - val_loss: 0.5604 - val_acc: 0.7500
Epoch 7/200
4000/4000 [==============================] - 0s 92us/sample - loss: 0.5773 - acc: 0.7343 - val_loss: 0.5574 - val_acc: 0.7510
Epoch 8/200
4000/4000 [==============================] - 0s 96us/sample - loss: 0.5761 - acc: 0.7368 - val_loss: 0.5557 - val_acc: 0.7500
Epoch 9/200
4000/4000 [==============================] - 0s 93us/sample - loss: 0.5680 - acc: 0.7395 - val_loss: 0.5531 - val_acc: 0.7510
Epoch 10/200
4000/4000 [==============================] - 0s 90us/sample - loss: 0.5658 - acc: 0.7393 - val_loss: 0.5521 - val_acc: 0.7510
Epoch 11/200
4000/4000 [==============================] - 0s 87us/sample - loss: 0.5676 - acc: 0.7393 - val_loss: 0.5516 - val_acc: 0.7510
Epoch 12/200
4000/4000 [==============================] - 0s 94us/sample - loss: 0.5640 - acc: 0.7375 - val_loss: 0.5514 - val_acc: 0.7510
Epoch 13/200
4000/4000 [==============================] - 0s 91us/sample - loss: 0.5591 - acc: 0.7412 - val_loss: 0.5494 - val_acc: 0.7510
Epoch 14/200
4000/4000 [==============================] - 0s 85us/sample - loss: 0.5610 - acc: 0.7385 - val_loss: 0.5485 - val_acc: 0.7510
Epoch 15/200
4000/4000 [==============================] - 0s 91us/sample - loss: 0.5610 - acc: 0.7410 - val_loss: 0.5490 - val_acc: 0.7510
Epoch 16/200
4000/4000 [==============================] - 0s 86us/sample - loss: 0.5575 - acc: 0.7400 - val_loss: 0.5485 - val_acc: 0.7510
Epoch 17/200
4000/4000 [==============================] - 0s 88us/sample - loss: 0.5558 - acc: 0.7395 - val_loss: 0.5472 - val_acc: 0.7510
Epoch 18/200
4000/4000 [==============================] - 0s 86us/sample - loss: 0.5581 - acc: 0.7400 - val_loss: 0.5464 - val_acc: 0.7510
Epoch 19/200
4000/4000 [==============================] - 0s 94us/sample - loss: 0.5578 - acc: 0.7408 - val_loss: 0.5469 - val_acc: 0.7510
Epoch 20/200
4000/4000 [==============================] - 0s 84us/sample - loss: 0.5560 - acc: 0.7400 - val_loss: 0.5455 - val_acc: 0.7510
Epoch 21/200
4000/4000 [==============================] - 0s 88us/sample - loss: 0.5529 - acc: 0.7408 - val_loss: 0.5454 - val_acc: 0.7510
Epoch 22/200
4000/4000 [==============================] - 0s 91us/sample - loss: 0.5552 - acc: 0.7402 - val_loss: 0.5460 - val_acc: 0.7510
Epoch 23/200
4000/4000 [==============================] - 0s 85us/sample - loss: 0.5524 - acc: 0.7410 - val_loss: 0.5459 - val_acc: 0.7510
Epoch 24/200
4000/4000 [==============================] - 0s 106us/sample - loss: 0.5541 - acc: 0.7418 - val_loss: 0.5449 - val_acc: 0.7510
Epoch 25/200
4000/4000 [==============================] - 0s 120us/sample - loss: 0.5512 - acc: 0.7412 - val_loss: 0.5447 - val_acc: 0.7510
…
Epoch 176/200
4000/4000 [==============================] - 0s 87us/sample - loss: 0.5271 - acc: 0.7450 - val_loss: 0.5411 - val_acc: 0.7480
Epoch 177/200
4000/4000 [==============================] - 0s 87us/sample - loss: 0.5298 - acc: 0.7442 - val_loss: 0.5388 - val_acc: 0.7480
Epoch 178/200
4000/4000 [==============================] - 0s 92us/sample - loss: 0.5280 - acc: 0.7458 - val_loss: 0.5382 - val_acc: 0.7470
Epoch 179/200
4000/4000 [==============================] - 0s 91us/sample - loss: 0.5281 - acc: 0.7435 - val_loss: 0.5400 - val_acc: 0.7460
Epoch 180/200
4000/4000 [==============================] - 0s 89us/sample - loss: 0.5268 - acc: 0.7450 - val_loss: 0.5424 - val_acc: 0.7460
Epoch 181/200
4000/4000 [==============================] - 0s 90us/sample - loss: 0.5293 - acc: 0.7455 - val_loss: 0.5401 - val_acc: 0.7470
Epoch 182/200
4000/4000 [==============================] - 0s 96us/sample - loss: 0.5256 - acc: 0.7452 - val_loss: 0.5381 - val_acc: 0.7470
Epoch 183/200
4000/4000 [==============================] - 0s 92us/sample - loss: 0.5289 - acc: 0.7462 - val_loss: 0.5399 - val_acc: 0.7480
Epoch 184/200
4000/4000 [==============================] - 0s 87us/sample - loss: 0.5288 - acc: 0.7442 - val_loss: 0.5395 - val_acc: 0.7480
Epoch 185/200
4000/4000 [==============================] - 0s 99us/sample - loss: 0.5260 - acc: 0.7477 - val_loss: 0.5381 - val_acc: 0.7470
Epoch 186/200
4000/4000 [==============================] - 0s 86us/sample - loss: 0.5314 - acc: 0.7448 - val_loss: 0.5375 - val_acc: 0.7470
Epoch 187/200
4000/4000 [==============================] - 0s 109us/sample - loss: 0.5295 - acc: 0.7437 - val_loss: 0.5396 - val_acc: 0.7460
Epoch 188/200
4000/4000 [==============================] - 0s 95us/sample - loss: 0.5297 - acc: 0.7433 - val_loss: 0.5391 - val_acc: 0.7470
Epoch 189/200
4000/4000 [==============================] - 0s 96us/sample - loss: 0.5311 - acc: 0.7427 - val_loss: 0.5391 - val_acc: 0.7450
Epoch 190/200
4000/4000 [==============================] - 0s 95us/sample - loss: 0.5305 - acc: 0.7440 - val_loss: 0.5382 - val_acc: 0.7460
Epoch 191/200
4000/4000 [==============================] - 0s 92us/sample - loss: 0.5285 - acc: 0.7462 - val_loss: 0.5372 - val_acc: 0.7460
Epoch 192/200
4000/4000 [==============================] - 0s 91us/sample - loss: 0.5290 - acc: 0.7445 - val_loss: 0.5381 - val_acc: 0.7490
Epoch 193/200
4000/4000 [==============================] - 0s 87us/sample - loss: 0.5311 - acc: 0.7440 - val_loss: 0.5378 - val_acc: 0.7470
Epoch 194/200
4000/4000 [==============================] - 0s 87us/sample - loss: 0.5271 - acc: 0.7440 - val_loss: 0.5407 - val_acc: 0.7490
Epoch 195/200
4000/4000 [==============================] - 0s 95us/sample - loss: 0.5291 - acc: 0.7467 - val_loss: 0.5376 - val_acc: 0.7470
Epoch 196/200
4000/4000 [==============================] - 0s 94us/sample - loss: 0.5289 - acc: 0.7430 - val_loss: 0.5364 - val_acc: 0.7470
Epoch 197/200
4000/4000 [==============================] - 0s 94us/sample - loss: 0.5309 - acc: 0.7437 - val_loss: 0.5399 - val_acc: 0.7490
Epoch 198/200
4000/4000 [==============================] - 0s 85us/sample - loss: 0.5291 - acc: 0.7435 - val_loss: 0.5383 - val_acc: 0.7470
Epoch 199/200
4000/4000 [==============================] - 0s 94us/sample - loss: 0.5275 - acc: 0.7462 - val_loss: 0.5371 - val_acc: 0.7490
Epoch 200/200
4000/4000 [==============================] - 0s 87us/sample - loss: 0.5288 - acc: 0.7452 - val_loss: 0.5392 - val_acc: 0.7470

We can plot a graph that shows how the low progressed after every epoch. This is done by calling the history method of the fitted model.

#plot the loss vs validation loss graph
plt.plot(history.history['loss'], label='loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.legend()

Output:

As seen the validation loss and the loss are low. What’s more interesting is that the two losses are close to each other. This is an indication that the model is not underfitting or overfitting.

We can also plot the graph that shows the accuracy and validation accuracy per epoch.

#plot the accuracy vs the validation accuracy graph
plt.plot(history.history['acc'], label='acc')
plt.plot(history.history['val_acc'], label='val_acc')
plt.legend()

Output:

The model had an accuracy of 75 percent which is fairly good. The accuracy can further be increased by playing with the ANN architecture and some feature engineering. Also notice that the validation accuracy and the accuracy are close together. A clear indication that our model does not overfit or underfit.