Linear Regression with Keras on Tensorflow

Table of Contents

In the last tutorial, we introduced the concept of linear regression with Keras and how to build a Linear Regression problem using Tensorflow’s estimator API. In that tutorial, we neglected a step which for real-life problems is very vital. Building any machine learning model whatsoever would require you to preprocess the data before feeding it to the machine learning algorithm or neural network architecture. This is because some of the data may contain missing values, duplicate values, unreasonable computations, or even redundant features. These anomalies can greatly affect the performance of your model. Data preprocessing would involve data cleaning, data augmentation, exploratory data analysis, data standardization, normalization, feature extraction, etc. 

In this tutorial, we will be building a linear regression with Keras model, this time taking data preprocessing into account. Just as in the last tutorial, we will be using the Boston dataset to train and test our model. The Boston dataset is a popular dataset that relates the median price of a house to other relating factors. The dataset can be gotten from the Scikit-learn inbuilt dataset and the description of the dataset is shown below.

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

A machine learning model makes predictions by making generalizations from patterns in the trained data. This implies that the model must first learn embedded patterns in the data. And that’s intuitive. If I spill out a sequence of numbers say 2, 4, 6, 8, and I ask that you predict the next number, you’d most likely say 10. You know this because you discovered that the sequence of numbers has an increment of 2. You generalized the observed pattern in the data. 

This same principle for machine learning models. Only that this time, the data is way more bogus that we humans may not see the embedded patterns easily. For instance, the Boston dataset which is regarded as a small dataset has 13 features plus the target, with 506 samples. It’s almost impossible to get any substantial pattern from the data by barely looking at the numbers. 

But here’s the thing. Most times, not all features directly affect the target. Features that do not necessarily affect the labels are called noise should be tailed off or completely removed. Selecting the most important features would have a huge toll on the performance of your model. Not only does it make the data compact, it allows the model to learn patterns very quickly during training. 

Another point to take note of is that some features are highly correlated such that a change in one strongly affects the other. This is called multicollinearity. If you observe this occurrence in your data, it is good practice to remove one of the features or better still, merge both features into one. While multicollinearity may not affect your model’s performance, it is good practice to check for it and deal with it to remove dummy features. 

In this tutorial, you will learn the steps involved in data preprocessing and model building. By the end of this tutorial, you would discover

  • How to get a quick overview of your data
  • How to deal with missing values
  • Checking for multicollinearity
  • How to deal with multicollinearity
  • How to inspect your data
  • How to check for outliers
  • How to normalize and standardize your data
  • Building a neural network with Keras
  • Training a neural network
  • Evaluating the model
  • Improving the model

These steps are the framework for building machine learning models. Let’s dive in. 

Data Overview

Inspecting the data is a critical step when building a machine model. This is because many times, the data have some imperfections. Moreso, you’d need to be conversant with the features of the data and their specific data type. A good and common practice is to check the first five rows of the data. 

Let’s start by importing the necessary libraries and downloading the data from sklearn.datasets method. Throughout the course of this tutorial, we will use other libraries such as matplotlib, seaborn, NumPy, and of course TensorFlow. 

# import the necessary libraries
from sklearn.datasets import load_boston
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
from sklearn.decomposition import PCA
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

#load the dataset
data = load_boston()
#convert the dataset into a Pandas dataframe and add the target column named 'Price'
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Price'] = data.target

We’d do this using the head() method of pandas to print the first 5 rows of the dataset. Needless to say, you’d need to have pandas installed on your machine. If you do not, simply type pip install pandas on your console.

Output:

CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   
 
   PTRATIO       B  LSTAT  Price  
0     15.3  396.90   4.98   24.0  
1     17.8  396.90   9.14   21.6  
2     17.8  392.83   4.03   34.7  
3     18.7  394.63   2.94   33.4  
4     18.7  396.90   5.33   36.2  

Let’s see the number of rows and columns we have in our dataset. This will help give an idea of how large the dataset is. This is done using the shape attribute of the dataframe. 

#check the number of rows and columns in the dataset

df.shape

Output:

(506, 14)

To get an overview of the data, we use the describe() method. The method shows the mean, standard deviation, minimum value, 25th percentile, median, 75th percentile, and the maximum value of each column. 

Output:

    count        mean         std        min         25%        50%  \
CRIM     506.0    3.613524    8.601545    0.00632    0.082045    0.25651   
ZN       506.0   11.363636   23.322453    0.00000    0.000000    0.00000   
INDUS    506.0   11.136779    6.860353    0.46000    5.190000    9.69000   
CHAS     506.0    0.069170    0.253994    0.00000    0.000000    0.00000   
NOX      506.0    0.554695    0.115878    0.38500    0.449000    0.53800   
RM       506.0    6.284634    0.702617    3.56100    5.885500    6.20850   
AGE      506.0   68.574901   28.148861    2.90000   45.025000   77.50000   
DIS      506.0    3.795043    2.105710    1.12960    2.100175    3.20745   
RAD      506.0    9.549407    8.707259    1.00000    4.000000    5.00000   
TAX      506.0  408.237154  168.537116  187.00000  279.000000  330.00000   
PTRATIO  506.0   18.455534    2.164946   12.60000   17.400000   19.05000   
B        506.0  356.674032   91.294864    0.32000  375.377500  391.44000   
LSTAT    506.0   12.653063    7.141062    1.73000    6.950000   11.36000   
Price    506.0   22.532806    9.197104    5.00000   17.025000   21.20000   
                75%       max  
CRIM       3.677083   88.9762  
ZN        12.500000  100.0000  
INDUS     18.100000   27.7400  
CHAS       0.000000    1.0000  
NOX        0.624000    0.8710  
RM         6.623500    8.7800  
AGE       94.075000  100.0000  
DIS        5.188425   12.1265  
RAD       24.000000   24.0000  
TAX      666.000000  711.0000  
PTRATIO   20.200000   22.0000  
B        396.225000  396.9000  
LSTAT     16.955000   37.9700  
Price     25.000000   50.0000 

You’d observe that while some columns contain averagely large numbers (e.g. TAX column with a mean of 168.5), small others contain small numbers (eg NOX column with a mean value of 0.39). Having a dataset with such a wide range of numbers makes it difficult for our machine learning model to learn. To fix this, the data should be rescaled through standardization or normalization. We would explain what these terms mean later in this tutorial. 

Dealing with Missing Values

Going forward, we check for missing values. The presence of missing values can greatly affect how the machine learning model behaves. This makes it critically important to check if missing values exist in your data and deal with them appropriately. To check for missing values, we use the isnull() method 

#check for null values
df.isnull().sum()

Output:

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
Price      0
dtype: int64

As seen, missing values were not present in this particular dataset which is usually not the case for untouched real-world data. In a situation where missing values exist, you can drop the rows completely if they are relatively not many. If however, the missing values are many, it is not advisable to drop all rows containing the missing values as you’d be losing a lot of information. In such cases, you can replace missing values with the aggregation of the column such as mean, median, or mode. 

Checking for Multicollinearity

Multicollinearity may not have a serious impact on the performance of most machine learning algorithms but it is important to check for multicollinearity to have a better understanding of your data. So let’s discuss what multicollinearity is first. Multicollinearity occurs when two or more independent variables (features) are strongly correlated. This can be a big deal in linear regression problems as it reduces the efficacy of the linear regression coefficient. By implication, you won’t have a clear insight into how the features affect the target variable. 

Let’s say you have a linear regression problem given by the equation

Y = m1x1 + m2x2 + m3x3 +… mnxn+ c

If X1 and X2 are strongly correlated, an increase in X1 will cause an increase in X2. Thus, you won’t be able to determine how X1 and X2 individually affect the target variable, Y. 

So now we have an idea of what multicollinearity is, how do we detect it?

Before dealing with multicollinearity, you must first detect it. There are a couple of methods to detect multicollinearity. One way is to plot the correlation matrix for the data using a heat map and observe the features that have a strong correlation (it could be positive or negative). Another method is to calculate the VIF and check for columns with close VIF values.  In this tutorial, we will focus on the VIF method. 

VIF stands for Variable Inflation Factor. It is given by the inverse of 1 – R2 value. 

For VIF scores equal to 1, it means there is no correlation at all. 

If the VIF score ranges from 1 to 5, it means there’s a slight correlation. 

While for VIF scores greater than 10, it means there is a strong correlation.

We calculate the VIF scores for each column using the statsmodels library. The code that calculates the VIF scores and creates a DataFrame is shown below. 

def create_vif(dataframe):
    ''' This function calculates the Variation Inflation Factors for each column and convert it into a dataframe'''
    
    #create an empty dataframe
    vif_table = pd.DataFrame()
    #populate the first column with the columns of the dataset
    vif_table['variables'] = dataframe.columns
    #calculate the VIF of each column and create a VIF column to store the number
    vif_table['VIF'] = [vif(dataframe.values, i) for i in range(df.shape[1])]
    
    return vif_table

#print the VIF table for each variable
print(create_vif(df))

Output:

  variables         VIF
0       CRIM    2.131404
1         ZN    2.910004
2      INDUS   14.485874
3       CHAS    1.176266
4        NOX   74.004269
5         RM  136.101743
6        AGE   21.398863
7        DIS   15.430455
8        RAD   15.369980
9        TAX   61.939713
10   PTRATIO   87.227233
11         B   21.351015
12     LSTAT   12.615188
13     Price   24.503206

As seen from the table, DIS, RAD, and INDUS have VIF scores of 15.43, 15.38, and 14.48 respectively. These values are greater than 10 and are close together. By implication, these three columns are strongly correlated. So how do we deal with them?

Dealing with Multicollinearity

There’s the option of dropping one (or more columns, if there are more than 3 strongly correlated columns) that are strongly correlated to be left with just one of such columns. Of course, the idea is that the column left behind would behave like the ones dropped and can stand in their stead. Some other data scientists combine all the correlated columns into one. 

Here, we will combine the correlated columns into one to accommodate the slightest behavior of the individual columns. We do this by using the Principal Component Analysis (PCA) transformation technique. This technique is used for reducing the dimensions of data without losing the important properties of each column. To do the PCA transformation, we instantiate the class and then fit transform the class on the correlated columns. The code below explains this procedure.  

#compress the columns 'DIS', 'RAD', 'INDUS' into 1 column
pca = PCA(n_components=1)
#call the compressed column 'new'
df['new'] = pca.fit_transform(df[['DIS', 'RAD', 'INDUS']])
#drop the three columns from the dataset
df = df.drop(['DIS', 'RAD', 'INDUS'], axis=1)

With the new dataframe, we can recheck the VIF using the function we created earlier. 

Now if we check the new columns, you’d realize that the column has a VIF that is less than 10 which is good. 

#recheck the new VIF table
print(create_vif(df))

Output:

  variables         VIF
0       CRIM    2.006392
1         ZN    2.349186
2       CHAS    1.173519
3        NOX   65.166302
4         RM  133.757986
5        AGE   18.823276
6        TAX   56.391909
7    PTRATIO   77.938234
8          B   21.345554
9      LSTAT   12.580803
10     Price   23.131681
11       new    9.194328

Inspecting the Data

You should inspect your data by drawing a plot of features against each other. Seaborn library provides an easy way to do this with the pairplot method. We select 3 correlated features with high VIF (NOX, RM, TAX) and 2 features with low VIF (LSTAT, new). 

#print a pairplot to check the relationships between strongly correlated features
pp = sns.pairplot(df[['NOX', 'RM', 'TAX', 'LSTAT', 'new']])
pp = pp.map_lower(sns.regplot)
pp = pp.map_upper(sns.kdeplot);

Output:

Linear Regression with Keras on Tensorflow

We can see the relationships in the features from the pairplot. For some features, the data points follow a pattern. This is a pointer to the fact that a linear regression model can learn the data and subsequently make predictions. You’d also notice that some data points are far from where the majority are. We will be discussing next, how to make our model robust to such data points, called outliers.

Checking for Outliers

Statistical parameters such as mean and standard deviation as well as machine learning algorithms such as linear regression and ANOVA are sensitive to outliers. Ideally, the distribution numbers in the column should follow a normal distribution curve (bell shape), where the majority of the class appears at the center. A dataset with outliers however has exceptionally high values in the extreme end of the distribution curve. These unusual occurrences are called outliers. 

Data outliers immensely affect the training of machine learning models. Most times, it causes longer training time and reduced model accuracy. 

There are various ways of detecting outliers in a dataset. For this tutorial, we’d be plotting boxplots to visualize how the data points are distributed. An outlier is anywhere above or below the whiskers of the boxplots. They are typically identified with a circle above or below the boxplot whiskers. 

We would use the seaborn library to plot a boxplot for the independent variables of the dataset. 

df1 = df.copy()
# # Create a figure with 10 subplots with a width spacing of 1.5    
fig, ax = plt.subplots(2,5)
fig.subplots_adjust(wspace=1.5)

# Create a boxplot for the continuous features      
box_plot1 = sns.boxplot(y=np.log(df1[df1.columns[0]]), ax=ax[0][0])
box_plot2 = sns.boxplot(y=np.log(df1[df1.columns[1]]), ax=ax[0][1])
box_plot3 = sns.boxplot(y=np.log(df1[df1.columns[2]]), ax=ax[0][2])
box_plot4 = sns.boxplot(y=np.log(df1[df1.columns[3]]), ax=ax[0][3])
box_plot5 = sns.boxplot(y=np.log(df1[df1.columns[4]]), ax=ax[0][4])
box_plot6 = sns.boxplot(y=np.log(df1[df1.columns[5]]), ax=ax[1][0])
box_plot7 = sns.boxplot(y=np.log(df1[df1.columns[6]]), ax=ax[1][1])
box_plot8 = sns.boxplot(y=np.log(df1[df1.columns[-3]]), ax=ax[1][2])
box_plot9 = sns.boxplot(y=np.log(df1[df1.columns[8]]), ax=ax[1][3])
box_plot10 = sns.boxplot(y=np.log(df1[df1.columns[10]]), ax=ax[1][4])
;

Output:

Linear Regression with Keras on Tensorflow

From the boxplots, you’d observe that features such as RM, AGE, PTRATIO, B, and LSTAT have outliers. So how do we deal with them? It is intrinsically not the best of ideas to drop rows containing outliers. Especially in situations where the outliers are many, we’d be losing a lot of information. You can decide to normalize your data such that it is robust to outliers. 

Data Normalization and Standardization 

We can rescale the data distribution through normalization or standardization. Standardization involves you rescaling your data such that the minimum and maximum values are within a predetermined range.  Normalization on the other hand involves you rescaling your data such that the frequency distribution curve is reshaped to something more like the bell curve shape.

Scikit learn’s preprocessing allows us to carry out the various standardization and normalization steps. Let’s discuss some of the options. 

1. StandardScaler: This rescales the data by subtracting all the entries from the mean value and dividing it by the standard deviation. After a StandardScaler step has been carried out, the mean of the distribution is equal to zero while 67.7% of the distribution falls between -1 and 1

 2. MinMaxScaler: The MinMaxScaler is done by subtracting the minimum value in the feature and dividing by the range of the feature. The MinMaxScaler does not change the shape of the distribution but shrinks the frequency distribution between 0 to 1. 

3. RobustScaler: The RobustScaler subtracts the median value from each entry and divides by the interquartile range of the feature. Since RobustScaler divides by the interquartile range, the returned frequency distribution penalizes outliers. This makes RobustScaler robust for data with outliers. 

Since our data contains outliers, we will standardize it using the RobustScaler class. Note that we’d need to split the data into train and test data first. We would also need to specifically change the CHAS column (a categorical feature) into One-Hot Encoded features. We then fit the RobustScaler class on the train dataset but transform on my train and test dataset. The code below does all these. 

#One-Hot Encode the CHAS column
df = pd.get_dummies(df, columns=['CHAS'], drop_first=True)
#define the features and the labels, X and y
X = df.drop(['Price'], axis=1)
y = df['Price']

#split the features and labels into  train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

#rescale the data to be robust to outliers
scaler = RobustScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Now we have preprocessed the data, it is time to build the neural network model using Keras. 

Building a Multilayer Neural Network with Tensorflow Keras. 

Before training our model, we have to build it. Building the architecture of a neural network in Keras is done using the Sequential class. Layers can be added to whatever numbers you desire. 

First off, we will create a single hidden layer and see how the model performs. 

Since the data we are passing into the model has 11 features, we must define the input_dim parameter in the first layer and set it to 11. Our single hidden is set to have 15 nodes and then it’s passed to the output layer with just one node. Since it is a linear regression problem and the output is just one number, the final layer should have one node. 

In addition, the hidden layer has a ReLu activation function whereas the output function has a linear activation function. If you don’t know what activation functions are, I like to see them as ‘switches’ that are responsible for aggregating the weights of the nodes to give an output to the next node input. 

The code to build the neural network architecture is shown below. 

#built the neural network architecture
model = Sequential()
model.add(Dense(15, input_dim=11, activation='relu'))
model.add(Dense(1, activation='linear'))

The next step is to compile the model. We use an Adam optimizer with a mean squared error loss. We defined the validation metrics to be mean squared error and mean absolute error. 

model.compile(loss='mse', optimizer='adam', metrics=['mse', 'mae'])

Training the Model

The model was trained on 200 epochs with a validation set of 20% of the train data. The validation set helps you check how well the model is learning during the training process, based on the loss function. 

#train the neural network on the train dataset
history = model.fit(X_train, y_train, epochs=200, validation_split=0.2)
Output:
Train on 323 samples, validate on 81 samples
Epoch 1/200
323/323 [==============================] - 0s 1ms/sample - loss: 616.0037 - mean_squared_error: 616.0037 - mean_absolute_error: 23.3245 - val_loss: 584.0988 - val_mean_squared_error: 584.0989 - val_mean_absolute_error: 22.4651
Epoch 2/200
323/323 [==============================] - 0s 127us/sample - loss: 606.8097 - mean_squared_error: 606.8097 - mean_absolute_error: 23.1052 - val_loss: 576.2635 - val_mean_squared_error: 576.2634 - val_mean_absolute_error: 22.2775
Epoch 3/200
323/323 [==============================] - 0s 161us/sample - loss: 598.1349 - mean_squared_error: 598.1349 - mean_absolute_error: 22.8789 - val_loss: 568.6242 - val_mean_squared_error: 568.6242 - val_mean_absolute_error: 22.0914
Epoch 4/200
323/323 [==============================] - 0s 248us/sample - loss: 590.0231 - mean_squared_error: 590.0231 - mean_absolute_error: 22.6751 - val_loss: 561.2776 - val_mean_squared_error: 561.2776 - val_mean_absolute_error: 21.9079
Epoch 5/200
323/323 [==============================] - 0s 161us/sample - loss: 582.1993 - mean_squared_error: 582.1993 - mean_absolute_error: 22.4697 - val_loss: 554.2171 - val_mean_squared_error: 554.2170 - val_mean_absolute_error: 21.7276
Epoch 6/200
323/323 [==============================] - 0s 198us/sample - loss: 574.5526 - mean_squared_error: 574.5526 - mean_absolute_error: 22.2655 - val_loss: 547.2002 - val_mean_squared_error: 547.2002 - val_mean_absolute_error: 21.5468
Epoch 7/200
323/323 [==============================] - 0s 248us/sample - loss: 566.7739 - mean_squared_error: 566.7739 - mean_absolute_error: 22.0529 - val_loss: 540.1250 - val_mean_squared_error: 540.1251 - val_mean_absolute_error: 21.3606
Epoch 8/200
323/323 [==============================] - 0s 111us/sample - loss: 559.2289 - mean_squared_error: 559.2289 - mean_absolute_error: 21.8367 - val_loss: 532.9769 - val_mean_squared_error: 532.9769 - val_mean_absolute_error: 21.1680
Epoch 9/200
323/323 [==============================] - 0s 111us/sample - loss: 551.4707 - mean_squared_error: 551.4707 - mean_absolute_error: 21.6204 - val_loss: 526.0247 - val_mean_squared_error: 526.0247 - val_mean_absolute_error: 20.9819
Epoch 10/200
323/323 [==============================] - 0s 149us/sample - loss: 543.9210 - mean_squared_error: 543.9210 - mean_absolute_error: 21.4173 - val_loss: 519.0010 - val_mean_squared_error: 519.0010 - val_mean_absolute_error: 20.7915
Epoch 11/200
323/323 [==============================] - 0s 124us/sample - loss: 536.3257 - mean_squared_error: 536.3257 - mean_absolute_error: 21.2125 - val_loss: 511.7967 - val_mean_squared_error: 511.7967 - val_mean_absolute_error: 20.5944
Epoch 12/200
323/323 [==============================] - 0s 136us/sample - loss: 528.6936 - mean_squared_error: 528.6937 - mean_absolute_error: 21.0106 - val_loss: 504.5885 - val_mean_squared_error: 504.5885 - val_mean_absolute_error: 20.3977
Epoch 13/200
323/323 [==============================] - 0s 124us/sample - loss: 520.8847 - mean_squared_error: 520.8847 - mean_absolute_error: 20.7995 - val_loss: 497.2613 - val_mean_squared_error: 497.2613 - val_mean_absolute_error: 20.2193
Epoch 14/200
323/323 [==============================] - 0s 124us/sample - loss: 513.0849 - mean_squared_error: 513.0849 - mean_absolute_error: 20.5858 - val_loss: 489.8176 - val_mean_squared_error: 489.8176 - val_mean_absolute_error: 20.0351
Epoch 15/200
323/323 [==============================] - 0s 124us/sample - loss: 505.3566 - mean_squared_error: 505.3567 - mean_absolute_error: 20.3856 - val_loss: 482.2511 - val_mean_squared_error: 482.2511 - val_mean_absolute_error: 19.8488
Epoch 16/200
323/323 [==============================] - 0s 99us/sample - loss: 497.5187 - mean_squared_error: 497.5188 - mean_absolute_error: 20.1893 - val_loss: 474.6838 - val_mean_squared_error: 474.6838 - val_mean_absolute_error: 19.6661
Epoch 17/200
323/323 [==============================] - 0s 149us/sample - loss: 489.7085 - mean_squared_error: 489.7086 - mean_absolute_error: 19.9929 - val_loss: 467.2122 - val_mean_squared_error: 467.2122 - val_mean_absolute_error: 19.4878
Epoch 18/200
323/323 [==============================] - 0s 223us/sample - loss: 482.0081 - mean_squared_error: 482.0081 - mean_absolute_error: 19.8129 - val_loss: 459.4699 - val_mean_squared_error: 459.4698 - val_mean_absolute_error: 19.3026
Epoch 19/200
323/323 [==============================] - 0s 124us/sample - loss: 474.0288 - mean_squared_error: 474.0287 - mean_absolute_error: 19.6281 - val_loss: 451.8731 - val_mean_squared_error: 451.8731 - val_mean_absolute_error: 19.1187
Epoch 20/200
323/323 [==============================] - 0s 111us/sample - loss: 466.1271 - mean_squared_error: 466.1271 - mean_absolute_error: 19.4428 - val_loss: 444.5884 - val_mean_squared_error: 444.5884 - val_mean_absolute_error: 18.9436
…
Epoch 181/200
323/323 [==============================] - 0s 149us/sample - loss: 28.0329 - mean_squared_error: 28.0329 - mean_absolute_error: 3.8922 - val_loss: 29.0025 - val_mean_squared_error: 29.0025 - val_mean_absolute_error: 3.8905
Epoch 182/200
323/323 [==============================] - 0s 136us/sample - loss: 27.7569 - mean_squared_error: 27.7569 - mean_absolute_error: 3.8608 - val_loss: 28.9420 - val_mean_squared_error: 28.9420 - val_mean_absolute_error: 3.8719
Epoch 183/200
323/323 [==============================] - 0s 124us/sample - loss: 27.5550 - mean_squared_error: 27.5550 - mean_absolute_error: 3.8354 - val_loss: 28.9521 - val_mean_squared_error: 28.9521 - val_mean_absolute_error: 3.8516
Epoch 184/200
323/323 [==============================] - 0s 149us/sample - loss: 27.3054 - mean_squared_error: 27.3054 - mean_absolute_error: 3.8107 - val_loss: 28.5168 - val_mean_squared_error: 28.5168 - val_mean_absolute_error: 3.8161
Epoch 185/200
323/323 [==============================] - 0s 173us/sample - loss: 27.0219 - mean_squared_error: 27.0219 - mean_absolute_error: 3.7885 - val_loss: 28.0858 - val_mean_squared_error: 28.0858 - val_mean_absolute_error: 3.7814
Epoch 186/200
323/323 [==============================] - 0s 161us/sample - loss: 26.7649 - mean_squared_error: 26.7649 - mean_absolute_error: 3.7670 - val_loss: 27.8294 - val_mean_squared_error: 27.8294 - val_mean_absolute_error: 3.7574
Epoch 187/200
323/323 [==============================] - 0s 136us/sample - loss: 26.5128 - mean_squared_error: 26.5128 - mean_absolute_error: 3.7427 - val_loss: 27.4006 - val_mean_squared_error: 27.4006 - val_mean_absolute_error: 3.7293
Epoch 188/200
323/323 [==============================] - 0s 161us/sample - loss: 26.3242 - mean_squared_error: 26.3242 - mean_absolute_error: 3.7329 - val_loss: 27.1109 - val_mean_squared_error: 27.1109 - val_mean_absolute_error: 3.7049
Epoch 189/200
323/323 [==============================] - 0s 136us/sample - loss: 26.0745 - mean_squared_error: 26.0745 - mean_absolute_error: 3.7042 - val_loss: 27.0394 - val_mean_squared_error: 27.0394 - val_mean_absolute_error: 3.6909
Epoch 190/200
323/323 [==============================] - 0s 161us/sample - loss: 25.8574 - mean_squared_error: 25.8574 - mean_absolute_error: 3.6782 - val_loss: 26.9795 - val_mean_squared_error: 26.9795 - val_mean_absolute_error: 3.6774
Epoch 191/200
323/323 [==============================] - 0s 149us/sample - loss: 25.6682 - mean_squared_error: 25.6682 - mean_absolute_error: 3.6587 - val_loss: 26.8557 - val_mean_squared_error: 26.8557 - val_mean_absolute_error: 3.6599
Epoch 192/200
323/323 [==============================] - 0s 149us/sample - loss: 25.4568 - mean_squared_error: 25.4568 - mean_absolute_error: 3.6391 - val_loss: 26.5597 - val_mean_squared_error: 26.5597 - val_mean_absolute_error: 3.6302
Epoch 193/200
323/323 [==============================] - 0s 111us/sample - loss: 25.2383 - mean_squared_error: 25.2383 - mean_absolute_error: 3.6239 - val_loss: 26.2430 - val_mean_squared_error: 26.2430 - val_mean_absolute_error: 3.6019
Epoch 194/200
323/323 [==============================] - 0s 124us/sample - loss: 25.0200 - mean_squared_error: 25.0200 - mean_absolute_error: 3.6001 - val_loss: 26.2021 - val_mean_squared_error: 26.2021 - val_mean_absolute_error: 3.5890
Epoch 195/200
323/323 [==============================] - 0s 124us/sample - loss: 24.8465 - mean_squared_error: 24.8465 - mean_absolute_error: 3.5796 - val_loss: 25.9885 - val_mean_squared_error: 25.9885 - val_mean_absolute_error: 3.5653
Epoch 196/200
323/323 [==============================] - 0s 111us/sample - loss: 24.6697 - mean_squared_error: 24.6697 - mean_absolute_error: 3.5667 - val_loss: 25.7908 - val_mean_squared_error: 25.7908 - val_mean_absolute_error: 3.5423
Epoch 197/200
323/323 [==============================] - 0s 99us/sample - loss: 24.4858 - mean_squared_error: 24.4858 - mean_absolute_error: 3.5508 - val_loss: 25.7717 - val_mean_squared_error: 25.7717 - val_mean_absolute_error: 3.5298
Epoch 198/200
323/323 [==============================] - 0s 136us/sample - loss: 24.2800 - mean_squared_error: 24.2800 - mean_absolute_error: 3.5314 - val_loss: 25.8030 - val_mean_squared_error: 25.8030 - val_mean_absolute_error: 3.5115
Epoch 199/200
323/323 [==============================] - 0s 99us/sample - loss: 24.2206 - mean_squared_error: 24.2206 - mean_absolute_error: 3.5227 - val_loss: 25.5244 - val_mean_squared_error: 25.5244 - val_mean_absolute_error: 3.4847
Epoch 200/200
323/323 [==============================] - 0s 111us/sample - loss: 23.9753 - mean_squared_error: 23.9753 - mean_absolute_error: 3.5040 - val_loss: 25.1087 - val_mean_squared_error: 25.1087 - val_mean_absolute_error: 3.4590

We have successfully trained our model. You’d notice that the loss moved from 616.0 in the first epoch to 23.9 in the 200th epoch. This shows that the model was improving as upon every epoch. 

To visualize the losses, we will convert the history object into a dataframe and plot the graph of the loss and the validation loss. If the gap between the two is high, it means the model has not appreciably learnt the data. 

#plot the loss and validation loss of the dataset
history_df = pd.DataFrame(history.history)
plt.plot(history_df['loss'], label='loss')
plt.plot(history_df['val_loss'], label='val_loss')

plt.legend()

Output:

Linear Regression with Keras on Tensorflow

In our case, the model has learned the data well since the training loss and validation loss are close. Furthermore, the loss greatly dropped in the first few epochs and stabilized at some point without going up. This is an indication that while the model has learned, it is not overfitting. 

Notice that at the end of the training, these are the following loss values.

Training loss: 23.97

Mean absolute error: 3.50

Validation loss: 25.11

Validation mean absolute error: 3.46

Evaluating the Model

We can evaluate the model with the evaluate() method. This compares the result the model predicts with the result from the test data and calculates the loss/error.

#evaluate the model
model.evaluate(X_test, y_test, batch_size=128)

Output:

102/102 [==============================] - 0s 153us/sample - loss: 22.5740 - mean_squared_error: 22.5740 - mean_absolute_error: 3.5839
Out[51]:
[22.573974609375, 22.573975, 3.5838845]

We can have an overview of what the model predicts versus its actual value using a simple plot.

y_pred = model.predict(X_test).flatten()

a = plt.axes(aspect='equal')
plt.scatter(y_test, y_pred)
plt.xlabel('True values')
plt.ylabel('Predicted values')
plt.title('A plot that shows the true and predicted values')
plt.xlim([0, 60])
plt.ylim([0, 60])
plt.plot([0, 60], [0, 60])

Output:

Linear Regression with Keras on Tensorflow

From the plot, you’d see that the model has performed reasonably well in making correct predictions. 

We could still tweak our model to further enhance its performance. There are a lot of techniques that can be used to improve a neural network. They include adding more hidden layers, increasing the number of nodes in a layer, changing the activation function, adding more data, tweaking optimizer parameters, etc. 

Let’s see how adding more hidden layers will improve the model. 

Improving the Model by Adding More Hidden Layers

One of the ways of improving neural network performance is by adding more hidden layers. Always remember, deeper is better. 

So let’s go ahead to change our model by adding 2 more layers. One with 7 nodes and the other with 3 nodes. Both still with the relu activation function. Just like in the last model, we compile it with a mean squared error loss, an Adam optimizer and with both the mean squared error and mean absolute error metrics. 

#built the neural network architecture
model = Sequential()
model.add(Dense(15, input_dim=11, activation='relu'))
model.add(Dense(7, activation='relu'))
model.add(Dense(3, activation='relu'))
model.add(Dense(1, activation='linear'))

model.compile(loss='mse', optimizer='adam', metrics=['mse', 'mae'])

#train the neural network on the train dataset
history = model.fit(X_train, y_train, epochs=200, validation_split=0.2)

Output:

Train on 323 samples, validate on 81 samples
Epoch 1/200
323/323 [==============================] - 1s 2ms/sample - loss: 584.4734 - mean_squared_error: 584.4734 - mean_absolute_error: 22.6072 - val_loss: 553.0111 - val_mean_squared_error: 553.0111 - val_mean_absolute_error: 21.6163
Epoch 2/200
323/323 [==============================] - 0s 97us/sample - loss: 575.5218 - mean_squared_error: 575.5219 - mean_absolute_error: 22.3731 - val_loss: 544.7089 - val_mean_squared_error: 544.7089 - val_mean_absolute_error: 21.4320
Epoch 3/200
323/323 [==============================] - 0s 97us/sample - loss: 565.2366 - mean_squared_error: 565.2367 - mean_absolute_error: 22.1050 - val_loss: 535.9432 - val_mean_squared_error: 535.9432 - val_mean_absolute_error: 21.2384
Epoch 4/200
323/323 [==============================] - 0s 145us/sample - loss: 554.2672 - mean_squared_error: 554.2672 - mean_absolute_error: 21.8140 - val_loss: 525.9688 - val_mean_squared_error: 525.9689 - val_mean_absolute_error: 21.0172
Epoch 5/200
323/323 [==============================] - 0s 145us/sample - loss: 541.8079 - mean_squared_error: 541.8079 - mean_absolute_error: 21.4882 - val_loss: 514.2750 - val_mean_squared_error: 514.2750 - val_mean_absolute_error: 20.7664
Epoch 6/200
323/323 [==============================] - 0s 145us/sample - loss: 527.6235 - mean_squared_error: 527.6235 - mean_absolute_error: 21.1237 - val_loss: 500.1802 - val_mean_squared_error: 500.1802 - val_mean_absolute_error: 20.4756
Epoch 7/200
323/323 [==============================] - 0s 145us/sample - loss: 510.5902 - mean_squared_error: 510.5903 - mean_absolute_error: 20.7072 - val_loss: 483.6809 - val_mean_squared_error: 483.6808 - val_mean_absolute_error: 20.1316
Epoch 8/200
323/323 [==============================] - 0s 145us/sample - loss: 490.7871 - mean_squared_error: 490.7871 - mean_absolute_error: 20.2235 - val_loss: 463.2415 - val_mean_squared_error: 463.2415 - val_mean_absolute_error: 19.7122
Epoch 9/200
323/323 [==============================] - 0s 145us/sample - loss: 465.3827 - mean_squared_error: 465.3828 - mean_absolute_error: 19.6485 - val_loss: 439.1797 - val_mean_squared_error: 439.1796 - val_mean_absolute_error: 19.2099
Epoch 10/200
323/323 [==============================] - 0s 145us/sample - loss: 436.7313 - mean_squared_error: 436.7312 - mean_absolute_error: 18.9943 - val_loss: 410.8449 - val_mean_squared_error: 410.8448 - val_mean_absolute_error: 18.5876
Epoch 11/200
323/323 [==============================] - 0s 145us/sample - loss: 404.6039 - mean_squared_error: 404.6039 - mean_absolute_error: 18.2399 - val_loss: 379.6046 - val_mean_squared_error: 379.6046 - val_mean_absolute_error: 17.8701
Epoch 12/200
323/323 [==============================] - 0s 145us/sample - loss: 369.8315 - mean_squared_error: 369.8315 - mean_absolute_error: 17.4045 - val_loss: 346.7320 - val_mean_squared_error: 346.7320 - val_mean_absolute_error: 17.0592
Epoch 13/200
323/323 [==============================] - 0s 145us/sample - loss: 332.8788 - mean_squared_error: 332.8788 - mean_absolute_error: 16.4958 - val_loss: 314.1923 - val_mean_squared_error: 314.1923 - val_mean_absolute_error: 16.2052
Epoch 14/200
323/323 [==============================] - 0s 145us/sample - loss: 298.7931 - mean_squared_error: 298.7931 - mean_absolute_error: 15.5864 - val_loss: 281.9098 - val_mean_squared_error: 281.9098 - val_mean_absolute_error: 15.3273
Epoch 15/200
323/323 [==============================] - 0s 145us/sample - loss: 265.7078 - mean_squared_error: 265.7079 - mean_absolute_error: 14.5916 - val_loss: 253.8650 - val_mean_squared_error: 253.8650 - val_mean_absolute_error: 14.4485
Epoch 16/200
323/323 [==============================] - 0s 97us/sample - loss: 237.9645 - mean_squared_error: 237.9644 - mean_absolute_error: 13.6058 - val_loss: 230.3261 - val_mean_squared_error: 230.3261 - val_mean_absolute_error: 13.6310
Epoch 17/200
323/323 [==============================] - 0s 97us/sample - loss: 213.5237 - mean_squared_error: 213.5237 - mean_absolute_error: 12.7039 - val_loss: 210.8874 - val_mean_squared_error: 210.8874 - val_mean_absolute_error: 13.0260
Epoch 18/200
323/323 [==============================] - 0s 97us/sample - loss: 193.0863 - mean_squared_error: 193.0863 - mean_absolute_error: 11.8859 - val_loss: 194.1782 - val_mean_squared_error: 194.1782 - val_mean_absolute_error: 12.4450
Epoch 19/200
323/323 [==============================] - 0s 97us/sample - loss: 176.8083 - mean_squared_error: 176.8083 - mean_absolute_error: 11.3360 - val_loss: 180.8584 - val_mean_squared_error: 180.8584 - val_mean_absolute_error: 11.8897
Epoch 20/200
323/323 [==============================] - 0s 145us/sample - loss: 164.0756 - mean_squared_error: 164.0756 - mean_absolute_error: 10.8522 - val_loss: 171.2320 - val_mean_squared_error: 171.2320 - val_mean_absolute_error: 11.4639
…
Epoch 181/200
323/323 [==============================] - 0s 97us/sample - loss: 12.1372 - mean_squared_error: 12.1372 - mean_absolute_error: 2.3793 - val_loss: 15.9544 - val_mean_squared_error: 15.9544 - val_mean_absolute_error: 2.3558
Epoch 182/200
323/323 [==============================] - 0s 145us/sample - loss: 12.0800 - mean_squared_error: 12.0800 - mean_absolute_error: 2.3553 - val_loss: 15.8774 - val_mean_squared_error: 15.8774 - val_mean_absolute_error: 2.3423
Epoch 183/200
323/323 [==============================] - 0s 97us/sample - loss: 12.0202 - mean_squared_error: 12.0202 - mean_absolute_error: 2.3414 - val_loss: 15.7801 - val_mean_squared_error: 15.7801 - val_mean_absolute_error: 2.3369
Epoch 184/200
323/323 [==============================] - 0s 145us/sample - loss: 11.9876 - mean_squared_error: 11.9876 - mean_absolute_error: 2.3502 - val_loss: 15.7188 - val_mean_squared_error: 15.7188 - val_mean_absolute_error: 2.3659
Epoch 185/200
323/323 [==============================] - 0s 242us/sample - loss: 11.9647 - mean_squared_error: 11.9647 - mean_absolute_error: 2.3655 - val_loss: 15.8191 - val_mean_squared_error: 15.8191 - val_mean_absolute_error: 2.4131
Epoch 186/200
323/323 [==============================] - 0s 145us/sample - loss: 12.0691 - mean_squared_error: 12.0691 - mean_absolute_error: 2.4635 - val_loss: 16.2266 - val_mean_squared_error: 16.2266 - val_mean_absolute_error: 2.6174
Epoch 187/200
323/323 [==============================] - 0s 145us/sample - loss: 12.1569 - mean_squared_error: 12.1569 - mean_absolute_error: 2.4610 - val_loss: 15.6773 - val_mean_squared_error: 15.6773 - val_mean_absolute_error: 2.4570
Epoch 188/200
323/323 [==============================] - 0s 145us/sample - loss: 11.8792 - mean_squared_error: 11.8792 - mean_absolute_error: 2.3678 - val_loss: 15.7074 - val_mean_squared_error: 15.7074 - val_mean_absolute_error: 2.3972
Epoch 189/200
323/323 [==============================] - 0s 145us/sample - loss: 11.9190 - mean_squared_error: 11.9190 - mean_absolute_error: 2.3617 - val_loss: 15.8393 - val_mean_squared_error: 15.8393 - val_mean_absolute_error: 2.3808
Epoch 190/200
323/323 [==============================] - 0s 145us/sample - loss: 12.8232 - mean_squared_error: 12.8232 - mean_absolute_error: 2.5695 - val_loss: 16.4048 - val_mean_squared_error: 16.4048 - val_mean_absolute_error: 2.6977
Epoch 191/200
323/323 [==============================] - 0s 145us/sample - loss: 12.0817 - mean_squared_error: 12.0817 - mean_absolute_error: 2.4824 - val_loss: 15.5024 - val_mean_squared_error: 15.5024 - val_mean_absolute_error: 2.4516
Epoch 192/200
323/323 [==============================] - 0s 145us/sample - loss: 11.8084 - mean_squared_error: 11.8084 - mean_absolute_error: 2.3831 - val_loss: 15.4221 - val_mean_squared_error: 15.4221 - val_mean_absolute_error: 2.4194
Epoch 193/200
323/323 [==============================] - 0s 97us/sample - loss: 11.7507 - mean_squared_error: 11.7507 - mean_absolute_error: 2.3955 - val_loss: 15.4557 - val_mean_squared_error: 15.4557 - val_mean_absolute_error: 2.4357
Epoch 194/200
323/323 [==============================] - 0s 145us/sample - loss: 11.6437 - mean_squared_error: 11.6437 - mean_absolute_error: 2.3657 - val_loss: 15.3709 - val_mean_squared_error: 15.3709 - val_mean_absolute_error: 2.3435
Epoch 195/200
323/323 [==============================] - 0s 145us/sample - loss: 11.6290 - mean_squared_error: 11.6290 - mean_absolute_error: 2.3445 - val_loss: 15.3940 - val_mean_squared_error: 15.3940 - val_mean_absolute_error: 2.3470
Epoch 196/200
323/323 [==============================] - 0s 97us/sample - loss: 11.6334 - mean_squared_error: 11.6334 - mean_absolute_error: 2.3860 - val_loss: 15.4824 - val_mean_squared_error: 15.4824 - val_mean_absolute_error: 2.3938
Epoch 197/200
323/323 [==============================] - 0s 145us/sample - loss: 11.6110 - mean_squared_error: 11.6110 - mean_absolute_error: 2.3495 - val_loss: 15.5030 - val_mean_squared_error: 15.5030 - val_mean_absolute_error: 2.2746
Epoch 198/200
323/323 [==============================] - 0s 145us/sample - loss: 11.8521 - mean_squared_error: 11.8521 - mean_absolute_error: 2.3540 - val_loss: 15.2363 - val_mean_squared_error: 15.2363 - val_mean_absolute_error: 2.3209
Epoch 199/200
323/323 [==============================] - 0s 145us/sample - loss: 11.5532 - mean_squared_error: 11.5532 - mean_absolute_error: 2.3486 - val_loss: 15.3506 - val_mean_squared_error: 15.3506 - val_mean_absolute_error: 2.3752
Epoch 200/200
323/323 [==============================] - 0s 97us/sample - loss: 11.4892 - mean_squared_error: 11.4892 - mean_absolute_error: 2.3523 - val_loss: 15.3902 - val_mean_squared_error: 15.3902 - val_mean_absolute_error: 2.3758

Let’s see the losses in contrast. 

Training loss: 11.49

Mean absolute error: 2.35

Validation loss: 15.39

Validation mean absolute error: 2.38

Notice that the loss is now 11.49 even with the same number of epochs. We can again plot the graph to show both the training loss and validation loss. 

#plot the loss and validation loss of the dataset
history_df = pd.DataFrame(history.history)
plt.plot(history_df['loss'], label='loss')
plt.plot(history_df['val_loss'], label='val_loss')

plt.legend()

Output:

Linear Regression with Keras on Tensorflow

As seen in the figure above, the model’s loss stabilizes after the first 50 epochs. This is an improvement as it took the previous model 175 epochs to get to the local minimum. 

We can also evaluate this model to determine how accurate it is. 

#evaluate the model
model.evaluate(X_test, y_test, batch_size=128)

Output:

102/102 [==============================] - 0s 0s/sample - loss: 13.0725 - mean_squared_error: 13.0725 - mean_absolute_error: 2.7085

Finally, we will visualize the prediction with a simple plot. 

y_pred = model.predict(X_test).flatten()

a = plt.axes(aspect='equal')
plt.scatter(y_test, y_pred)
plt.xlabel('True values')
plt.ylabel('Predicted values')
plt.title('A plot that shows the true and predicted values')
plt.xlim([0, 60])
plt.ylim([0, 60])
plt.plot([0, 60], [0, 60])

Output:

Linear Regression with Keras on Tensorflow

Notice that this time, the data points are more concentrated on the straight line. It indicates that our model is performing well. 

Conclusion

In this tutorial, you have learned the step-by-step approach to data preprocessing and building a linear regression model. We saw how to build a neural network using Keras in TensorFlow and went a step further to improve the model by increasing the number of hidden layers. 

Share this article