In the last tutorial, we introduced the concept of linear regression with Keras and how to build a Linear Regression problem using Tensorflow’s estimator API. In that tutorial, we neglected a step which for real-life problems is very vital. Building any machine learning model whatsoever would require you to preprocess the data before feeding it to the machine learning algorithm or neural network architecture. This is because some of the data may contain missing values, duplicate values, unreasonable computations, or even redundant features. These anomalies can greatly affect the performance of your model. Data preprocessing would involve data cleaning, data augmentation, exploratory data analysis, data standardization, normalization, feature extraction, etc.
In this tutorial, we will be building a linear regression with Keras model, this time taking data preprocessing into account. Just as in the last tutorial, we will be using the Boston dataset to train and test our model. The Boston dataset is a popular dataset that relates the median price of a house to other relating factors. The dataset can be gotten from the Scikit-learn inbuilt dataset and the description of the dataset is shown below.
Boston house prices dataset --------------------------- **Data Set Characteristics:** :Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target. :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's :Missing Attribute Values: None
A machine learning model makes predictions by making generalizations from patterns in the trained data. This implies that the model must first learn embedded patterns in the data. And that’s intuitive. If I spill out a sequence of numbers say 2, 4, 6, 8, and I ask that you predict the next number, you’d most likely say 10. You know this because you discovered that the sequence of numbers has an increment of 2. You generalized the observed pattern in the data.
This same principle for machine learning models. Only that this time, the data is way more bogus that we humans may not see the embedded patterns easily. For instance, the Boston dataset which is regarded as a small dataset has 13 features plus the target, with 506 samples. It’s almost impossible to get any substantial pattern from the data by barely looking at the numbers.
But here’s the thing. Most times, not all features directly affect the target. Features that do not necessarily affect the labels are called noise should be tailed off or completely removed. Selecting the most important features would have a huge toll on the performance of your model. Not only does it make the data compact, it allows the model to learn patterns very quickly during training.
Another point to take note of is that some features are highly correlated such that a change in one strongly affects the other. This is called multicollinearity. If you observe this occurrence in your data, it is good practice to remove one of the features or better still, merge both features into one. While multicollinearity may not affect your model’s performance, it is good practice to check for it and deal with it to remove dummy features.
In this tutorial, you will learn the steps involved in data preprocessing and model building. By the end of this tutorial, you would discover
- How to get a quick overview of your data
- How to deal with missing values
- Checking for multicollinearity
- How to deal with multicollinearity
- How to inspect your data
- How to check for outliers
- How to normalize and standardize your data
- Building a neural network with Keras
- Training a neural network
- Evaluating the model
- Improving the model
These steps are the framework for building machine learning models. Let’s dive in.
Data Overview
Inspecting the data is a critical step when building a machine model. This is because many times, the data have some imperfections. Moreso, you’d need to be conversant with the features of the data and their specific data type. A good and common practice is to check the first five rows of the data.
Let’s start by importing the necessary libraries and downloading the data from sklearn.datasets method. Throughout the course of this tutorial, we will use other libraries such as matplotlib, seaborn, NumPy, and of course TensorFlow.
# import the necessary libraries from sklearn.datasets import load_boston import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np from statsmodels.stats.outliers_influence import variance_inflation_factor as vif from sklearn.decomposition import PCA from sklearn.preprocessing import RobustScaler from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import train_test_split import tensorflow as tf from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense #load the dataset data = load_boston() #convert the dataset into a Pandas dataframe and add the target column named 'Price' df = pd.DataFrame(data.data, columns=data.feature_names) df['Price'] = data.target
We’d do this using the head() method of pandas to print the first 5 rows of the dataset. Needless to say, you’d need to have pandas installed on your machine. If you do not, simply type pip install pandas on your console.
Output:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \ 0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 PTRATIO B LSTAT Price 0 15.3 396.90 4.98 24.0 1 17.8 396.90 9.14 21.6 2 17.8 392.83 4.03 34.7 3 18.7 394.63 2.94 33.4 4 18.7 396.90 5.33 36.2
Let’s see the number of rows and columns we have in our dataset. This will help give an idea of how large the dataset is. This is done using the shape attribute of the dataframe.
#check the number of rows and columns in the dataset
df.shape
Output:
(506, 14)
To get an overview of the data, we use the describe() method. The method shows the mean, standard deviation, minimum value, 25th percentile, median, 75th percentile, and the maximum value of each column.
Output:
count mean std min 25% 50% \ CRIM 506.0 3.613524 8.601545 0.00632 0.082045 0.25651 ZN 506.0 11.363636 23.322453 0.00000 0.000000 0.00000 INDUS 506.0 11.136779 6.860353 0.46000 5.190000 9.69000 CHAS 506.0 0.069170 0.253994 0.00000 0.000000 0.00000 NOX 506.0 0.554695 0.115878 0.38500 0.449000 0.53800 RM 506.0 6.284634 0.702617 3.56100 5.885500 6.20850 AGE 506.0 68.574901 28.148861 2.90000 45.025000 77.50000 DIS 506.0 3.795043 2.105710 1.12960 2.100175 3.20745 RAD 506.0 9.549407 8.707259 1.00000 4.000000 5.00000 TAX 506.0 408.237154 168.537116 187.00000 279.000000 330.00000 PTRATIO 506.0 18.455534 2.164946 12.60000 17.400000 19.05000 B 506.0 356.674032 91.294864 0.32000 375.377500 391.44000 LSTAT 506.0 12.653063 7.141062 1.73000 6.950000 11.36000 Price 506.0 22.532806 9.197104 5.00000 17.025000 21.20000
75% max CRIM 3.677083 88.9762 ZN 12.500000 100.0000 INDUS 18.100000 27.7400 CHAS 0.000000 1.0000 NOX 0.624000 0.8710 RM 6.623500 8.7800 AGE 94.075000 100.0000 DIS 5.188425 12.1265 RAD 24.000000 24.0000 TAX 666.000000 711.0000 PTRATIO 20.200000 22.0000 B 396.225000 396.9000 LSTAT 16.955000 37.9700 Price 25.000000 50.0000
You’d observe that while some columns contain averagely large numbers (e.g. TAX column with a mean of 168.5), small others contain small numbers (eg NOX column with a mean value of 0.39). Having a dataset with such a wide range of numbers makes it difficult for our machine learning model to learn. To fix this, the data should be rescaled through standardization or normalization. We would explain what these terms mean later in this tutorial.
Dealing with Missing Values
Going forward, we check for missing values. The presence of missing values can greatly affect how the machine learning model behaves. This makes it critically important to check if missing values exist in your data and deal with them appropriately. To check for missing values, we use the isnull() method
#check for null values df.isnull().sum()
Output:
CRIM 0 ZN 0 INDUS 0 CHAS 0 NOX 0 RM 0 AGE 0 DIS 0 RAD 0 TAX 0 PTRATIO 0 B 0 LSTAT 0 Price 0 dtype: int64
As seen, missing values were not present in this particular dataset which is usually not the case for untouched real-world data. In a situation where missing values exist, you can drop the rows completely if they are relatively not many. If however, the missing values are many, it is not advisable to drop all rows containing the missing values as you’d be losing a lot of information. In such cases, you can replace missing values with the aggregation of the column such as mean, median, or mode.
Checking for Multicollinearity
Multicollinearity may not have a serious impact on the performance of most machine learning algorithms but it is important to check for multicollinearity to have a better understanding of your data. So let’s discuss what multicollinearity is first. Multicollinearity occurs when two or more independent variables (features) are strongly correlated. This can be a big deal in linear regression problems as it reduces the efficacy of the linear regression coefficient. By implication, you won’t have a clear insight into how the features affect the target variable.
Let’s say you have a linear regression problem given by the equation
Y = m1x1 + m2x2 + m3x3 +… mnxn+ c
If X1 and X2 are strongly correlated, an increase in X1 will cause an increase in X2. Thus, you won’t be able to determine how X1 and X2 individually affect the target variable, Y.
So now we have an idea of what multicollinearity is, how do we detect it?
Before dealing with multicollinearity, you must first detect it. There are a couple of methods to detect multicollinearity. One way is to plot the correlation matrix for the data using a heat map and observe the features that have a strong correlation (it could be positive or negative). Another method is to calculate the VIF and check for columns with close VIF values. In this tutorial, we will focus on the VIF method.
VIF stands for Variable Inflation Factor. It is given by the inverse of 1 – R2 value.
For VIF scores equal to 1, it means there is no correlation at all.
If the VIF score ranges from 1 to 5, it means there’s a slight correlation.
While for VIF scores greater than 10, it means there is a strong correlation.
We calculate the VIF scores for each column using the statsmodels library. The code that calculates the VIF scores and creates a DataFrame is shown below.
def create_vif(dataframe): ''' This function calculates the Variation Inflation Factors for each column and convert it into a dataframe''' #create an empty dataframe vif_table = pd.DataFrame() #populate the first column with the columns of the dataset vif_table['variables'] = dataframe.columns #calculate the VIF of each column and create a VIF column to store the number vif_table['VIF'] = [vif(dataframe.values, i) for i in range(df.shape[1])] return vif_table #print the VIF table for each variable print(create_vif(df))
Output:
variables VIF 0 CRIM 2.131404 1 ZN 2.910004 2 INDUS 14.485874 3 CHAS 1.176266 4 NOX 74.004269 5 RM 136.101743 6 AGE 21.398863 7 DIS 15.430455 8 RAD 15.369980 9 TAX 61.939713 10 PTRATIO 87.227233 11 B 21.351015 12 LSTAT 12.615188 13 Price 24.503206
As seen from the table, DIS, RAD, and INDUS have VIF scores of 15.43, 15.38, and 14.48 respectively. These values are greater than 10 and are close together. By implication, these three columns are strongly correlated. So how do we deal with them?
Dealing with Multicollinearity
There’s the option of dropping one (or more columns, if there are more than 3 strongly correlated columns) that are strongly correlated to be left with just one of such columns. Of course, the idea is that the column left behind would behave like the ones dropped and can stand in their stead. Some other data scientists combine all the correlated columns into one.
Here, we will combine the correlated columns into one to accommodate the slightest behavior of the individual columns. We do this by using the Principal Component Analysis (PCA) transformation technique. This technique is used for reducing the dimensions of data without losing the important properties of each column. To do the PCA transformation, we instantiate the class and then fit transform the class on the correlated columns. The code below explains this procedure.
#compress the columns 'DIS', 'RAD', 'INDUS' into 1 column pca = PCA(n_components=1) #call the compressed column 'new' df['new'] = pca.fit_transform(df[['DIS', 'RAD', 'INDUS']]) #drop the three columns from the dataset df = df.drop(['DIS', 'RAD', 'INDUS'], axis=1)
With the new dataframe, we can recheck the VIF using the function we created earlier.
Now if we check the new columns, you’d realize that the column has a VIF that is less than 10 which is good.
#recheck the new VIF table print(create_vif(df))
Output:
variables VIF 0 CRIM 2.006392 1 ZN 2.349186 2 CHAS 1.173519 3 NOX 65.166302 4 RM 133.757986 5 AGE 18.823276 6 TAX 56.391909 7 PTRATIO 77.938234 8 B 21.345554 9 LSTAT 12.580803 10 Price 23.131681 11 new 9.194328
Inspecting the Data
You should inspect your data by drawing a plot of features against each other. Seaborn library provides an easy way to do this with the pairplot method. We select 3 correlated features with high VIF (NOX, RM, TAX) and 2 features with low VIF (LSTAT, new).
#print a pairplot to check the relationships between strongly correlated features pp = sns.pairplot(df[['NOX', 'RM', 'TAX', 'LSTAT', 'new']]) pp = pp.map_lower(sns.regplot) pp = pp.map_upper(sns.kdeplot);
Output:
We can see the relationships in the features from the pairplot. For some features, the data points follow a pattern. This is a pointer to the fact that a linear regression model can learn the data and subsequently make predictions. You’d also notice that some data points are far from where the majority are. We will be discussing next, how to make our model robust to such data points, called outliers.
Checking for Outliers
Statistical parameters such as mean and standard deviation as well as machine learning algorithms such as linear regression and ANOVA are sensitive to outliers. Ideally, the distribution numbers in the column should follow a normal distribution curve (bell shape), where the majority of the class appears at the center. A dataset with outliers however has exceptionally high values in the extreme end of the distribution curve. These unusual occurrences are called outliers.
Data outliers immensely affect the training of machine learning models. Most times, it causes longer training time and reduced model accuracy.
There are various ways of detecting outliers in a dataset. For this tutorial, we’d be plotting boxplots to visualize how the data points are distributed. An outlier is anywhere above or below the whiskers of the boxplots. They are typically identified with a circle above or below the boxplot whiskers.
We would use the seaborn library to plot a boxplot for the independent variables of the dataset.
df1 = df.copy() # # Create a figure with 10 subplots with a width spacing of 1.5 fig, ax = plt.subplots(2,5) fig.subplots_adjust(wspace=1.5) # Create a boxplot for the continuous features box_plot1 = sns.boxplot(y=np.log(df1[df1.columns[0]]), ax=ax[0][0]) box_plot2 = sns.boxplot(y=np.log(df1[df1.columns[1]]), ax=ax[0][1]) box_plot3 = sns.boxplot(y=np.log(df1[df1.columns[2]]), ax=ax[0][2]) box_plot4 = sns.boxplot(y=np.log(df1[df1.columns[3]]), ax=ax[0][3]) box_plot5 = sns.boxplot(y=np.log(df1[df1.columns[4]]), ax=ax[0][4]) box_plot6 = sns.boxplot(y=np.log(df1[df1.columns[5]]), ax=ax[1][0]) box_plot7 = sns.boxplot(y=np.log(df1[df1.columns[6]]), ax=ax[1][1]) box_plot8 = sns.boxplot(y=np.log(df1[df1.columns[-3]]), ax=ax[1][2]) box_plot9 = sns.boxplot(y=np.log(df1[df1.columns[8]]), ax=ax[1][3]) box_plot10 = sns.boxplot(y=np.log(df1[df1.columns[10]]), ax=ax[1][4]) ;
Output:
From the boxplots, you’d observe that features such as RM, AGE, PTRATIO, B, and LSTAT have outliers. So how do we deal with them? It is intrinsically not the best of ideas to drop rows containing outliers. Especially in situations where the outliers are many, we’d be losing a lot of information. You can decide to normalize your data such that it is robust to outliers.
Data Normalization and Standardization
We can rescale the data distribution through normalization or standardization. Standardization involves you rescaling your data such that the minimum and maximum values are within a predetermined range. Normalization on the other hand involves you rescaling your data such that the frequency distribution curve is reshaped to something more like the bell curve shape.
Scikit learn’s preprocessing allows us to carry out the various standardization and normalization steps. Let’s discuss some of the options.
1. StandardScaler: This rescales the data by subtracting all the entries from the mean value and dividing it by the standard deviation. After a StandardScaler step has been carried out, the mean of the distribution is equal to zero while 67.7% of the distribution falls between -1 and 1
2. MinMaxScaler: The MinMaxScaler is done by subtracting the minimum value in the feature and dividing by the range of the feature. The MinMaxScaler does not change the shape of the distribution but shrinks the frequency distribution between 0 to 1.
3. RobustScaler: The RobustScaler subtracts the median value from each entry and divides by the interquartile range of the feature. Since RobustScaler divides by the interquartile range, the returned frequency distribution penalizes outliers. This makes RobustScaler robust for data with outliers.
Since our data contains outliers, we will standardize it using the RobustScaler class. Note that we’d need to split the data into train and test data first. We would also need to specifically change the CHAS column (a categorical feature) into One-Hot Encoded features. We then fit the RobustScaler class on the train dataset but transform on my train and test dataset. The code below does all these.
#One-Hot Encode the CHAS column df = pd.get_dummies(df, columns=['CHAS'], drop_first=True) #define the features and the labels, X and y X = df.drop(['Price'], axis=1) y = df['Price'] #split the features and labels into train and test data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) #rescale the data to be robust to outliers scaler = RobustScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test)
Now we have preprocessed the data, it is time to build the neural network model using Keras.
Building a Multilayer Neural Network with Tensorflow Keras.
Before training our model, we have to build it. Building the architecture of a neural network in Keras is done using the Sequential class. Layers can be added to whatever numbers you desire.
First off, we will create a single hidden layer and see how the model performs.
Since the data we are passing into the model has 11 features, we must define the input_dim parameter in the first layer and set it to 11. Our single hidden is set to have 15 nodes and then it’s passed to the output layer with just one node. Since it is a linear regression problem and the output is just one number, the final layer should have one node.
In addition, the hidden layer has a ReLu activation function whereas the output function has a linear activation function. If you don’t know what activation functions are, I like to see them as ‘switches’ that are responsible for aggregating the weights of the nodes to give an output to the next node input.
The code to build the neural network architecture is shown below.
#built the neural network architecture model = Sequential() model.add(Dense(15, input_dim=11, activation='relu')) model.add(Dense(1, activation='linear'))
The next step is to compile the model. We use an Adam optimizer with a mean squared error loss. We defined the validation metrics to be mean squared error and mean absolute error.
model.compile(loss='mse', optimizer='adam', metrics=['mse', 'mae'])
Training the Model
The model was trained on 200 epochs with a validation set of 20% of the train data. The validation set helps you check how well the model is learning during the training process, based on the loss function.
#train the neural network on the train dataset history = model.fit(X_train, y_train, epochs=200, validation_split=0.2)
Output: Train on 323 samples, validate on 81 samples Epoch 1/200 323/323 [==============================] - 0s 1ms/sample - loss: 616.0037 - mean_squared_error: 616.0037 - mean_absolute_error: 23.3245 - val_loss: 584.0988 - val_mean_squared_error: 584.0989 - val_mean_absolute_error: 22.4651 Epoch 2/200 323/323 [==============================] - 0s 127us/sample - loss: 606.8097 - mean_squared_error: 606.8097 - mean_absolute_error: 23.1052 - val_loss: 576.2635 - val_mean_squared_error: 576.2634 - val_mean_absolute_error: 22.2775 Epoch 3/200 323/323 [==============================] - 0s 161us/sample - loss: 598.1349 - mean_squared_error: 598.1349 - mean_absolute_error: 22.8789 - val_loss: 568.6242 - val_mean_squared_error: 568.6242 - val_mean_absolute_error: 22.0914 Epoch 4/200 323/323 [==============================] - 0s 248us/sample - loss: 590.0231 - mean_squared_error: 590.0231 - mean_absolute_error: 22.6751 - val_loss: 561.2776 - val_mean_squared_error: 561.2776 - val_mean_absolute_error: 21.9079 Epoch 5/200 323/323 [==============================] - 0s 161us/sample - loss: 582.1993 - mean_squared_error: 582.1993 - mean_absolute_error: 22.4697 - val_loss: 554.2171 - val_mean_squared_error: 554.2170 - val_mean_absolute_error: 21.7276 Epoch 6/200 323/323 [==============================] - 0s 198us/sample - loss: 574.5526 - mean_squared_error: 574.5526 - mean_absolute_error: 22.2655 - val_loss: 547.2002 - val_mean_squared_error: 547.2002 - val_mean_absolute_error: 21.5468 Epoch 7/200 323/323 [==============================] - 0s 248us/sample - loss: 566.7739 - mean_squared_error: 566.7739 - mean_absolute_error: 22.0529 - val_loss: 540.1250 - val_mean_squared_error: 540.1251 - val_mean_absolute_error: 21.3606 Epoch 8/200 323/323 [==============================] - 0s 111us/sample - loss: 559.2289 - mean_squared_error: 559.2289 - mean_absolute_error: 21.8367 - val_loss: 532.9769 - val_mean_squared_error: 532.9769 - val_mean_absolute_error: 21.1680 Epoch 9/200 323/323 [==============================] - 0s 111us/sample - loss: 551.4707 - mean_squared_error: 551.4707 - mean_absolute_error: 21.6204 - val_loss: 526.0247 - val_mean_squared_error: 526.0247 - val_mean_absolute_error: 20.9819 Epoch 10/200 323/323 [==============================] - 0s 149us/sample - loss: 543.9210 - mean_squared_error: 543.9210 - mean_absolute_error: 21.4173 - val_loss: 519.0010 - val_mean_squared_error: 519.0010 - val_mean_absolute_error: 20.7915 Epoch 11/200 323/323 [==============================] - 0s 124us/sample - loss: 536.3257 - mean_squared_error: 536.3257 - mean_absolute_error: 21.2125 - val_loss: 511.7967 - val_mean_squared_error: 511.7967 - val_mean_absolute_error: 20.5944 Epoch 12/200 323/323 [==============================] - 0s 136us/sample - loss: 528.6936 - mean_squared_error: 528.6937 - mean_absolute_error: 21.0106 - val_loss: 504.5885 - val_mean_squared_error: 504.5885 - val_mean_absolute_error: 20.3977 Epoch 13/200 323/323 [==============================] - 0s 124us/sample - loss: 520.8847 - mean_squared_error: 520.8847 - mean_absolute_error: 20.7995 - val_loss: 497.2613 - val_mean_squared_error: 497.2613 - val_mean_absolute_error: 20.2193 Epoch 14/200 323/323 [==============================] - 0s 124us/sample - loss: 513.0849 - mean_squared_error: 513.0849 - mean_absolute_error: 20.5858 - val_loss: 489.8176 - val_mean_squared_error: 489.8176 - val_mean_absolute_error: 20.0351 Epoch 15/200 323/323 [==============================] - 0s 124us/sample - loss: 505.3566 - mean_squared_error: 505.3567 - mean_absolute_error: 20.3856 - val_loss: 482.2511 - val_mean_squared_error: 482.2511 - val_mean_absolute_error: 19.8488 Epoch 16/200 323/323 [==============================] - 0s 99us/sample - loss: 497.5187 - mean_squared_error: 497.5188 - mean_absolute_error: 20.1893 - val_loss: 474.6838 - val_mean_squared_error: 474.6838 - val_mean_absolute_error: 19.6661 Epoch 17/200 323/323 [==============================] - 0s 149us/sample - loss: 489.7085 - mean_squared_error: 489.7086 - mean_absolute_error: 19.9929 - val_loss: 467.2122 - val_mean_squared_error: 467.2122 - val_mean_absolute_error: 19.4878 Epoch 18/200 323/323 [==============================] - 0s 223us/sample - loss: 482.0081 - mean_squared_error: 482.0081 - mean_absolute_error: 19.8129 - val_loss: 459.4699 - val_mean_squared_error: 459.4698 - val_mean_absolute_error: 19.3026 Epoch 19/200 323/323 [==============================] - 0s 124us/sample - loss: 474.0288 - mean_squared_error: 474.0287 - mean_absolute_error: 19.6281 - val_loss: 451.8731 - val_mean_squared_error: 451.8731 - val_mean_absolute_error: 19.1187 Epoch 20/200 323/323 [==============================] - 0s 111us/sample - loss: 466.1271 - mean_squared_error: 466.1271 - mean_absolute_error: 19.4428 - val_loss: 444.5884 - val_mean_squared_error: 444.5884 - val_mean_absolute_error: 18.9436 … Epoch 181/200 323/323 [==============================] - 0s 149us/sample - loss: 28.0329 - mean_squared_error: 28.0329 - mean_absolute_error: 3.8922 - val_loss: 29.0025 - val_mean_squared_error: 29.0025 - val_mean_absolute_error: 3.8905 Epoch 182/200 323/323 [==============================] - 0s 136us/sample - loss: 27.7569 - mean_squared_error: 27.7569 - mean_absolute_error: 3.8608 - val_loss: 28.9420 - val_mean_squared_error: 28.9420 - val_mean_absolute_error: 3.8719 Epoch 183/200 323/323 [==============================] - 0s 124us/sample - loss: 27.5550 - mean_squared_error: 27.5550 - mean_absolute_error: 3.8354 - val_loss: 28.9521 - val_mean_squared_error: 28.9521 - val_mean_absolute_error: 3.8516 Epoch 184/200 323/323 [==============================] - 0s 149us/sample - loss: 27.3054 - mean_squared_error: 27.3054 - mean_absolute_error: 3.8107 - val_loss: 28.5168 - val_mean_squared_error: 28.5168 - val_mean_absolute_error: 3.8161 Epoch 185/200 323/323 [==============================] - 0s 173us/sample - loss: 27.0219 - mean_squared_error: 27.0219 - mean_absolute_error: 3.7885 - val_loss: 28.0858 - val_mean_squared_error: 28.0858 - val_mean_absolute_error: 3.7814 Epoch 186/200 323/323 [==============================] - 0s 161us/sample - loss: 26.7649 - mean_squared_error: 26.7649 - mean_absolute_error: 3.7670 - val_loss: 27.8294 - val_mean_squared_error: 27.8294 - val_mean_absolute_error: 3.7574 Epoch 187/200 323/323 [==============================] - 0s 136us/sample - loss: 26.5128 - mean_squared_error: 26.5128 - mean_absolute_error: 3.7427 - val_loss: 27.4006 - val_mean_squared_error: 27.4006 - val_mean_absolute_error: 3.7293 Epoch 188/200 323/323 [==============================] - 0s 161us/sample - loss: 26.3242 - mean_squared_error: 26.3242 - mean_absolute_error: 3.7329 - val_loss: 27.1109 - val_mean_squared_error: 27.1109 - val_mean_absolute_error: 3.7049 Epoch 189/200 323/323 [==============================] - 0s 136us/sample - loss: 26.0745 - mean_squared_error: 26.0745 - mean_absolute_error: 3.7042 - val_loss: 27.0394 - val_mean_squared_error: 27.0394 - val_mean_absolute_error: 3.6909 Epoch 190/200 323/323 [==============================] - 0s 161us/sample - loss: 25.8574 - mean_squared_error: 25.8574 - mean_absolute_error: 3.6782 - val_loss: 26.9795 - val_mean_squared_error: 26.9795 - val_mean_absolute_error: 3.6774 Epoch 191/200 323/323 [==============================] - 0s 149us/sample - loss: 25.6682 - mean_squared_error: 25.6682 - mean_absolute_error: 3.6587 - val_loss: 26.8557 - val_mean_squared_error: 26.8557 - val_mean_absolute_error: 3.6599 Epoch 192/200 323/323 [==============================] - 0s 149us/sample - loss: 25.4568 - mean_squared_error: 25.4568 - mean_absolute_error: 3.6391 - val_loss: 26.5597 - val_mean_squared_error: 26.5597 - val_mean_absolute_error: 3.6302 Epoch 193/200 323/323 [==============================] - 0s 111us/sample - loss: 25.2383 - mean_squared_error: 25.2383 - mean_absolute_error: 3.6239 - val_loss: 26.2430 - val_mean_squared_error: 26.2430 - val_mean_absolute_error: 3.6019 Epoch 194/200 323/323 [==============================] - 0s 124us/sample - loss: 25.0200 - mean_squared_error: 25.0200 - mean_absolute_error: 3.6001 - val_loss: 26.2021 - val_mean_squared_error: 26.2021 - val_mean_absolute_error: 3.5890 Epoch 195/200 323/323 [==============================] - 0s 124us/sample - loss: 24.8465 - mean_squared_error: 24.8465 - mean_absolute_error: 3.5796 - val_loss: 25.9885 - val_mean_squared_error: 25.9885 - val_mean_absolute_error: 3.5653 Epoch 196/200 323/323 [==============================] - 0s 111us/sample - loss: 24.6697 - mean_squared_error: 24.6697 - mean_absolute_error: 3.5667 - val_loss: 25.7908 - val_mean_squared_error: 25.7908 - val_mean_absolute_error: 3.5423 Epoch 197/200 323/323 [==============================] - 0s 99us/sample - loss: 24.4858 - mean_squared_error: 24.4858 - mean_absolute_error: 3.5508 - val_loss: 25.7717 - val_mean_squared_error: 25.7717 - val_mean_absolute_error: 3.5298 Epoch 198/200 323/323 [==============================] - 0s 136us/sample - loss: 24.2800 - mean_squared_error: 24.2800 - mean_absolute_error: 3.5314 - val_loss: 25.8030 - val_mean_squared_error: 25.8030 - val_mean_absolute_error: 3.5115 Epoch 199/200 323/323 [==============================] - 0s 99us/sample - loss: 24.2206 - mean_squared_error: 24.2206 - mean_absolute_error: 3.5227 - val_loss: 25.5244 - val_mean_squared_error: 25.5244 - val_mean_absolute_error: 3.4847 Epoch 200/200 323/323 [==============================] - 0s 111us/sample - loss: 23.9753 - mean_squared_error: 23.9753 - mean_absolute_error: 3.5040 - val_loss: 25.1087 - val_mean_squared_error: 25.1087 - val_mean_absolute_error: 3.4590
We have successfully trained our model. You’d notice that the loss moved from 616.0 in the first epoch to 23.9 in the 200th epoch. This shows that the model was improving as upon every epoch.
To visualize the losses, we will convert the history object into a dataframe and plot the graph of the loss and the validation loss. If the gap between the two is high, it means the model has not appreciably learnt the data.
#plot the loss and validation loss of the dataset history_df = pd.DataFrame(history.history) plt.plot(history_df['loss'], label='loss') plt.plot(history_df['val_loss'], label='val_loss') plt.legend()
Output:
In our case, the model has learned the data well since the training loss and validation loss are close. Furthermore, the loss greatly dropped in the first few epochs and stabilized at some point without going up. This is an indication that while the model has learned, it is not overfitting.
Notice that at the end of the training, these are the following loss values.
Training loss: 23.97
Mean absolute error: 3.50
Validation loss: 25.11
Validation mean absolute error: 3.46
Evaluating the Model
We can evaluate the model with the evaluate() method. This compares the result the model predicts with the result from the test data and calculates the loss/error.
#evaluate the model model.evaluate(X_test, y_test, batch_size=128)
Output:
102/102 [==============================] - 0s 153us/sample - loss: 22.5740 - mean_squared_error: 22.5740 - mean_absolute_error: 3.5839 Out[51]: [22.573974609375, 22.573975, 3.5838845]
We can have an overview of what the model predicts versus its actual value using a simple plot.
y_pred = model.predict(X_test).flatten() a = plt.axes(aspect='equal') plt.scatter(y_test, y_pred) plt.xlabel('True values') plt.ylabel('Predicted values') plt.title('A plot that shows the true and predicted values') plt.xlim([0, 60]) plt.ylim([0, 60]) plt.plot([0, 60], [0, 60])
Output:
From the plot, you’d see that the model has performed reasonably well in making correct predictions.
We could still tweak our model to further enhance its performance. There are a lot of techniques that can be used to improve a neural network. They include adding more hidden layers, increasing the number of nodes in a layer, changing the activation function, adding more data, tweaking optimizer parameters, etc.
Let’s see how adding more hidden layers will improve the model.
Improving the Model by Adding More Hidden Layers
One of the ways of improving neural network performance is by adding more hidden layers. Always remember, deeper is better.
So let’s go ahead to change our model by adding 2 more layers. One with 7 nodes and the other with 3 nodes. Both still with the relu activation function. Just like in the last model, we compile it with a mean squared error loss, an Adam optimizer and with both the mean squared error and mean absolute error metrics.
#built the neural network architecture model = Sequential() model.add(Dense(15, input_dim=11, activation='relu')) model.add(Dense(7, activation='relu')) model.add(Dense(3, activation='relu')) model.add(Dense(1, activation='linear')) model.compile(loss='mse', optimizer='adam', metrics=['mse', 'mae']) #train the neural network on the train dataset history = model.fit(X_train, y_train, epochs=200, validation_split=0.2)
Output:
Train on 323 samples, validate on 81 samples Epoch 1/200 323/323 [==============================] - 1s 2ms/sample - loss: 584.4734 - mean_squared_error: 584.4734 - mean_absolute_error: 22.6072 - val_loss: 553.0111 - val_mean_squared_error: 553.0111 - val_mean_absolute_error: 21.6163 Epoch 2/200 323/323 [==============================] - 0s 97us/sample - loss: 575.5218 - mean_squared_error: 575.5219 - mean_absolute_error: 22.3731 - val_loss: 544.7089 - val_mean_squared_error: 544.7089 - val_mean_absolute_error: 21.4320 Epoch 3/200 323/323 [==============================] - 0s 97us/sample - loss: 565.2366 - mean_squared_error: 565.2367 - mean_absolute_error: 22.1050 - val_loss: 535.9432 - val_mean_squared_error: 535.9432 - val_mean_absolute_error: 21.2384 Epoch 4/200 323/323 [==============================] - 0s 145us/sample - loss: 554.2672 - mean_squared_error: 554.2672 - mean_absolute_error: 21.8140 - val_loss: 525.9688 - val_mean_squared_error: 525.9689 - val_mean_absolute_error: 21.0172 Epoch 5/200 323/323 [==============================] - 0s 145us/sample - loss: 541.8079 - mean_squared_error: 541.8079 - mean_absolute_error: 21.4882 - val_loss: 514.2750 - val_mean_squared_error: 514.2750 - val_mean_absolute_error: 20.7664 Epoch 6/200 323/323 [==============================] - 0s 145us/sample - loss: 527.6235 - mean_squared_error: 527.6235 - mean_absolute_error: 21.1237 - val_loss: 500.1802 - val_mean_squared_error: 500.1802 - val_mean_absolute_error: 20.4756 Epoch 7/200 323/323 [==============================] - 0s 145us/sample - loss: 510.5902 - mean_squared_error: 510.5903 - mean_absolute_error: 20.7072 - val_loss: 483.6809 - val_mean_squared_error: 483.6808 - val_mean_absolute_error: 20.1316 Epoch 8/200 323/323 [==============================] - 0s 145us/sample - loss: 490.7871 - mean_squared_error: 490.7871 - mean_absolute_error: 20.2235 - val_loss: 463.2415 - val_mean_squared_error: 463.2415 - val_mean_absolute_error: 19.7122 Epoch 9/200 323/323 [==============================] - 0s 145us/sample - loss: 465.3827 - mean_squared_error: 465.3828 - mean_absolute_error: 19.6485 - val_loss: 439.1797 - val_mean_squared_error: 439.1796 - val_mean_absolute_error: 19.2099 Epoch 10/200 323/323 [==============================] - 0s 145us/sample - loss: 436.7313 - mean_squared_error: 436.7312 - mean_absolute_error: 18.9943 - val_loss: 410.8449 - val_mean_squared_error: 410.8448 - val_mean_absolute_error: 18.5876 Epoch 11/200 323/323 [==============================] - 0s 145us/sample - loss: 404.6039 - mean_squared_error: 404.6039 - mean_absolute_error: 18.2399 - val_loss: 379.6046 - val_mean_squared_error: 379.6046 - val_mean_absolute_error: 17.8701 Epoch 12/200 323/323 [==============================] - 0s 145us/sample - loss: 369.8315 - mean_squared_error: 369.8315 - mean_absolute_error: 17.4045 - val_loss: 346.7320 - val_mean_squared_error: 346.7320 - val_mean_absolute_error: 17.0592 Epoch 13/200 323/323 [==============================] - 0s 145us/sample - loss: 332.8788 - mean_squared_error: 332.8788 - mean_absolute_error: 16.4958 - val_loss: 314.1923 - val_mean_squared_error: 314.1923 - val_mean_absolute_error: 16.2052 Epoch 14/200 323/323 [==============================] - 0s 145us/sample - loss: 298.7931 - mean_squared_error: 298.7931 - mean_absolute_error: 15.5864 - val_loss: 281.9098 - val_mean_squared_error: 281.9098 - val_mean_absolute_error: 15.3273 Epoch 15/200 323/323 [==============================] - 0s 145us/sample - loss: 265.7078 - mean_squared_error: 265.7079 - mean_absolute_error: 14.5916 - val_loss: 253.8650 - val_mean_squared_error: 253.8650 - val_mean_absolute_error: 14.4485 Epoch 16/200 323/323 [==============================] - 0s 97us/sample - loss: 237.9645 - mean_squared_error: 237.9644 - mean_absolute_error: 13.6058 - val_loss: 230.3261 - val_mean_squared_error: 230.3261 - val_mean_absolute_error: 13.6310 Epoch 17/200 323/323 [==============================] - 0s 97us/sample - loss: 213.5237 - mean_squared_error: 213.5237 - mean_absolute_error: 12.7039 - val_loss: 210.8874 - val_mean_squared_error: 210.8874 - val_mean_absolute_error: 13.0260 Epoch 18/200 323/323 [==============================] - 0s 97us/sample - loss: 193.0863 - mean_squared_error: 193.0863 - mean_absolute_error: 11.8859 - val_loss: 194.1782 - val_mean_squared_error: 194.1782 - val_mean_absolute_error: 12.4450 Epoch 19/200 323/323 [==============================] - 0s 97us/sample - loss: 176.8083 - mean_squared_error: 176.8083 - mean_absolute_error: 11.3360 - val_loss: 180.8584 - val_mean_squared_error: 180.8584 - val_mean_absolute_error: 11.8897 Epoch 20/200 323/323 [==============================] - 0s 145us/sample - loss: 164.0756 - mean_squared_error: 164.0756 - mean_absolute_error: 10.8522 - val_loss: 171.2320 - val_mean_squared_error: 171.2320 - val_mean_absolute_error: 11.4639 … Epoch 181/200 323/323 [==============================] - 0s 97us/sample - loss: 12.1372 - mean_squared_error: 12.1372 - mean_absolute_error: 2.3793 - val_loss: 15.9544 - val_mean_squared_error: 15.9544 - val_mean_absolute_error: 2.3558 Epoch 182/200 323/323 [==============================] - 0s 145us/sample - loss: 12.0800 - mean_squared_error: 12.0800 - mean_absolute_error: 2.3553 - val_loss: 15.8774 - val_mean_squared_error: 15.8774 - val_mean_absolute_error: 2.3423 Epoch 183/200 323/323 [==============================] - 0s 97us/sample - loss: 12.0202 - mean_squared_error: 12.0202 - mean_absolute_error: 2.3414 - val_loss: 15.7801 - val_mean_squared_error: 15.7801 - val_mean_absolute_error: 2.3369 Epoch 184/200 323/323 [==============================] - 0s 145us/sample - loss: 11.9876 - mean_squared_error: 11.9876 - mean_absolute_error: 2.3502 - val_loss: 15.7188 - val_mean_squared_error: 15.7188 - val_mean_absolute_error: 2.3659 Epoch 185/200 323/323 [==============================] - 0s 242us/sample - loss: 11.9647 - mean_squared_error: 11.9647 - mean_absolute_error: 2.3655 - val_loss: 15.8191 - val_mean_squared_error: 15.8191 - val_mean_absolute_error: 2.4131 Epoch 186/200 323/323 [==============================] - 0s 145us/sample - loss: 12.0691 - mean_squared_error: 12.0691 - mean_absolute_error: 2.4635 - val_loss: 16.2266 - val_mean_squared_error: 16.2266 - val_mean_absolute_error: 2.6174 Epoch 187/200 323/323 [==============================] - 0s 145us/sample - loss: 12.1569 - mean_squared_error: 12.1569 - mean_absolute_error: 2.4610 - val_loss: 15.6773 - val_mean_squared_error: 15.6773 - val_mean_absolute_error: 2.4570 Epoch 188/200 323/323 [==============================] - 0s 145us/sample - loss: 11.8792 - mean_squared_error: 11.8792 - mean_absolute_error: 2.3678 - val_loss: 15.7074 - val_mean_squared_error: 15.7074 - val_mean_absolute_error: 2.3972 Epoch 189/200 323/323 [==============================] - 0s 145us/sample - loss: 11.9190 - mean_squared_error: 11.9190 - mean_absolute_error: 2.3617 - val_loss: 15.8393 - val_mean_squared_error: 15.8393 - val_mean_absolute_error: 2.3808 Epoch 190/200 323/323 [==============================] - 0s 145us/sample - loss: 12.8232 - mean_squared_error: 12.8232 - mean_absolute_error: 2.5695 - val_loss: 16.4048 - val_mean_squared_error: 16.4048 - val_mean_absolute_error: 2.6977 Epoch 191/200 323/323 [==============================] - 0s 145us/sample - loss: 12.0817 - mean_squared_error: 12.0817 - mean_absolute_error: 2.4824 - val_loss: 15.5024 - val_mean_squared_error: 15.5024 - val_mean_absolute_error: 2.4516 Epoch 192/200 323/323 [==============================] - 0s 145us/sample - loss: 11.8084 - mean_squared_error: 11.8084 - mean_absolute_error: 2.3831 - val_loss: 15.4221 - val_mean_squared_error: 15.4221 - val_mean_absolute_error: 2.4194 Epoch 193/200 323/323 [==============================] - 0s 97us/sample - loss: 11.7507 - mean_squared_error: 11.7507 - mean_absolute_error: 2.3955 - val_loss: 15.4557 - val_mean_squared_error: 15.4557 - val_mean_absolute_error: 2.4357 Epoch 194/200 323/323 [==============================] - 0s 145us/sample - loss: 11.6437 - mean_squared_error: 11.6437 - mean_absolute_error: 2.3657 - val_loss: 15.3709 - val_mean_squared_error: 15.3709 - val_mean_absolute_error: 2.3435 Epoch 195/200 323/323 [==============================] - 0s 145us/sample - loss: 11.6290 - mean_squared_error: 11.6290 - mean_absolute_error: 2.3445 - val_loss: 15.3940 - val_mean_squared_error: 15.3940 - val_mean_absolute_error: 2.3470 Epoch 196/200 323/323 [==============================] - 0s 97us/sample - loss: 11.6334 - mean_squared_error: 11.6334 - mean_absolute_error: 2.3860 - val_loss: 15.4824 - val_mean_squared_error: 15.4824 - val_mean_absolute_error: 2.3938 Epoch 197/200 323/323 [==============================] - 0s 145us/sample - loss: 11.6110 - mean_squared_error: 11.6110 - mean_absolute_error: 2.3495 - val_loss: 15.5030 - val_mean_squared_error: 15.5030 - val_mean_absolute_error: 2.2746 Epoch 198/200 323/323 [==============================] - 0s 145us/sample - loss: 11.8521 - mean_squared_error: 11.8521 - mean_absolute_error: 2.3540 - val_loss: 15.2363 - val_mean_squared_error: 15.2363 - val_mean_absolute_error: 2.3209 Epoch 199/200 323/323 [==============================] - 0s 145us/sample - loss: 11.5532 - mean_squared_error: 11.5532 - mean_absolute_error: 2.3486 - val_loss: 15.3506 - val_mean_squared_error: 15.3506 - val_mean_absolute_error: 2.3752 Epoch 200/200 323/323 [==============================] - 0s 97us/sample - loss: 11.4892 - mean_squared_error: 11.4892 - mean_absolute_error: 2.3523 - val_loss: 15.3902 - val_mean_squared_error: 15.3902 - val_mean_absolute_error: 2.3758
Let’s see the losses in contrast.
Training loss: 11.49
Mean absolute error: 2.35
Validation loss: 15.39
Validation mean absolute error: 2.38
Notice that the loss is now 11.49 even with the same number of epochs. We can again plot the graph to show both the training loss and validation loss.
#plot the loss and validation loss of the dataset history_df = pd.DataFrame(history.history) plt.plot(history_df['loss'], label='loss') plt.plot(history_df['val_loss'], label='val_loss') plt.legend()
Output:
As seen in the figure above, the model’s loss stabilizes after the first 50 epochs. This is an improvement as it took the previous model 175 epochs to get to the local minimum.
We can also evaluate this model to determine how accurate it is.
#evaluate the model model.evaluate(X_test, y_test, batch_size=128)
Output:
102/102 [==============================] - 0s 0s/sample - loss: 13.0725 - mean_squared_error: 13.0725 - mean_absolute_error: 2.7085
Finally, we will visualize the prediction with a simple plot.
y_pred = model.predict(X_test).flatten() a = plt.axes(aspect='equal') plt.scatter(y_test, y_pred) plt.xlabel('True values') plt.ylabel('Predicted values') plt.title('A plot that shows the true and predicted values') plt.xlim([0, 60]) plt.ylim([0, 60]) plt.plot([0, 60], [0, 60])
Output:
Notice that this time, the data points are more concentrated on the straight line. It indicates that our model is performing well.
Conclusion
In this tutorial, you have learned the step-by-step approach to data preprocessing and building a linear regression model. We saw how to build a neural network using Keras in TensorFlow and went a step further to improve the model by increasing the number of hidden layers.