Machine Learning

Diving into Linear Regression Models

If we figured out our problem needs a regression approach to form a predictive model then generally we adopt linear model to start. Why ? The are easy to interpret, They get trained quickly, Optimization is easy and quite better etc.

Linear Model uses Linear functions to work upon data. We are going to explore four different linear model to solve our problem. These are popular Scikit-Learn Linear Models

  1. Linear Regression
  2. Ridge Regression
  3. Lasso Regression
  4. Elastic-Net Regression

Data set

Scikit-Learn comprises with some quite popular datasets.

Here we will use two examples : – Diabetic data and Boston House price data separately to work upon.

Diabetic data has 10 features and a target representing level of disease progression. Following code is used to import Diabetic data set.

#import pandas to generate data frame
import pandas as pd

#Import Data set
from sklearn.datasets import load_diabetes
diabetes, target = load_diabetes(return_X_y=True)

# Prepare data for modeling
diabetes = pd.DataFrame(diabetes)

#Import algo to Seperate Training and Validation sets
from sklearn.model_selection import train_test_split

# Separate input features and target
yd = target
Xd = diabetes

# setting up testing and validation sets with validation data of size 25% of whole
Xd_train, Xd_test, yd_train, yd_test = train_test_split(Xd, yd, test_size=0.25)

Boston House Price Data is having 13 features representing the factors which can affect a house price and a target variable representing the price status. We can import following the given code.

#import pandas to generate data frame
import pandas as pd

#Import Data set
from sklearn.datasets import load_boston
boston_feature, target = load_boston(return_X_y=True)

# Prepare data for modeling
boston_feature = pd.DataFrame(boston_feature)

#Import algo to Seperate Training and Validation sets
from sklearn.model_selection import train_test_split

# Separate input features and target
yb = target
Xb = boston_feature

# setting up testing and validation sets with validation data of size 25% of whole
Xb_train, Xb_test, yb_train, yb_test = train_test_split(Xb, yb, 
                                                    test_size=0.25)

“Life is ten percent what you experience and ninety percent how you respond to it.”

Dorothy M. Neddermeyer

How do we evaluate our Model ?

Ans. R² method

R-Square or say Coefficient of Determination helps our model to explain the variance in target variable. For example if value of my model is 0.82 it means model is accounting 82% of variability in data while predicting. i.e. Higher the value of shows a model is highly predictive. It ranges from 0 to 1.

Generally Higher value shows a good model. Still if the value is too high it might be the indication of over fitting. Over fitting means model is performing too good on training data but unable to generalize the unknown data. It might be because model memorize the data given for training

We will use Scikit-Learn cross validation algorithm through cross_val_score class to validate our model.

Linear Regression Using Least Square Method

Linear Regression is all defined to find the parameters with minimum possible mean squared error which is sum of squared differences between predicted and real value divided my total available samples.

For Linear Regression we need to use LinearRegression class from SciKit-Learn.

Code for training on Diabetic Data set and corresponding validation score.

from sklearn.model_selection import cross_val_score
import numpy as np

from sklearn.linear_model import LinearRegression

# Train model
lrd = LinearRegression().fit(Xd_train, yd_train)

# get cross val scores
scores_d = cross_val_score(lrd,Xd_train,yd_train, cv=5, scoring='r2')
print('CV Mean: ', np.mean(scores_d))
print('STD: ', np.std(scores_d))
print('\n')

#[Out]
#CV Mean:  0.459121847533
#STD:  0.108027388584

Code for training on Boston Data set and corresponding validation score.

from sklearn.model_selection import cross_val_score
import numpy as np

from sklearn.linear_model import LinearRegression

# Train model
lrb = LinearRegression().fit(Xb_train, yb_train)

# get cross val scores
scores_b = cross_val_score(lrb,Xb_train,yb_train, cv=5, scoring='r2')
print('CV Mean: ', np.mean(scores_b))
print('STD: ', np.std(scores_b))
print('\n')

#[Out]
#CV Mean:  0.743791700087
#STD:  0.0397478457728

For Diabetic Data set- value is 0.45 and standard deviation is 0.10. The low value shows model is quite inaccurate. Standard deviation shows model is over-fitted i.e. model is giving better prediction on known data set as compared to unknown.

We can overcome overfitting problem by simplifying the model using Regularization. Now we can explore certain variations by following Regularization techniques.

Ridge Regularization or L2 Regularization

Ridge regularization or L2 Regularization basically minimizes the magnitude of coefficients to reduce the complexity of model. Regularization parameter ⍺ will be used to control complexity of model.

Using higher value of ⍺ makes the coefficients to move towards zero which increases the restrictions over model. This might decrease model performance but increases the generalization of model. Too high value may lead to a very simple model which leads to underfit model.

⍺ (Too High)Training performance DecreaseMore Coefficient RestrictionUnderfit
⍺ (Too Low) Training performance Increase Less Coefficient Restriction Overfit

With lower value of coefficient do get restricted but with too much lower value the model becomes more similar to above linear regression and again a risk of overfittng.

We will use Scikit-Learn Ridge class for Regularization.

from sklearn.linear_model import Ridge

from sklearn.model_selection import cross_val_score
import numpy as np

# Train model with default alpha=1
ridge = Ridge(alpha=1).fit(Xd_train, yd_train)

# get cross val scores
scores_dr = cross_val_score(ridge,Xd_train,yd_train, cv=5, scoring='r2')
print('CV Mean: ', np.mean(scores_dr))
print('STD: ', np.std(scores_dr))
print('\n')

#[Out]
#CV Mean:  0.374873023827
#STD:  0.0239975009806

Here we are getting as 0.37 i.e. 38% of the variance with Ridge Regression model which shows model is not that much accurate but on the other hand standard deviation value is 0.02 which is lower enough to show model is not over fitted.

We used default value of alpha which might changes according to dataset. Lets adjust the alpha value and check if we can improve the performance. We will use grid search to find optimal alpha value.

#import GridSearchCV to optimize Hyper Parameter
from sklearn.model_selection import GridSearchCV

# find optimal alpha with grid search
alpha = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
param_grid = dict(alpha=alpha)
grid = GridSearchCV(estimator=ridge, param_grid=param_grid, 
                    scoring='r2', verbose=1, n_jobs=-1)

grid_result = grid.fit(Xd_train, yd_train)

print('Best Score: ', grid_result.best_score_)
print('Best Params: ', grid_result.best_params_)

#[OUT]
#Fitting 3 folds for each of 7 candidates, totalling 21 fits
#Best Score:  0.467058428738
#Best Params:  {'alpha': 0.1}

Our value increased still 0.46 is not a good number to rely upon. Lets see if we can optimize it further by using other type of regression.

Lasso Regression or L1 Regression

Lasso Regression forces some coefficient to exactly zero. i.e some features will be completely ignored by the model. It means it will automatic search for most important features. Suppose we are having large number of feature then lasso can turned out to good approach to reveal most important features.

⍺ (Too High)No. of features DecreaseMore Coefficient to 0Underfit
⍺ (Too Low) No. of features Increase Less Coefficient to 0 Overfit

Higher value of restrict more coefficients to zero which may cause underfit and vice versa.

Lets use Lasso Regression on Diabetic data set.

from sklearn.linear_model import Lasso

# Train model with default alpha=1
lasso = Lasso(alpha=1).fit(X_train, y_train)# get cross val scores

# get cross val scores
scores_dl = cross_val_score(lasso,Xd_train,yd_train, cv=5, scoring='r2')
print('CV Mean: ', np.mean(scores_dl))
print('STD: ', np.std(scores_dl))
print('\n')


#[OUT]
#CV Mean:  0.31169352912
#STD:  0.0331502459335

Here we used default value of alpha which results in 0.31 R-square value which is not that good. As before we can use grid search to work with more other possibilities.

# find optimal alpha with grid search
alpha = [0.001, 0.01, 0.1, 1, 10, 100, 1000]

param_grid = dict(alpha=alpha)
grid = GridSearchCV(estimator=lasso, param_grid=param_grid, 
                    scoring='r2', verbose=1, n_jobs=-1)

grid_result = grid.fit(Xd_train, yd_train)

print('Best Score: ', grid_result.best_score_)
print('Best Params: ', grid_result.best_params_)

#[out]
#Best Score:  0.466329253999
#Best Params:  {'alpha': 0.01}

Is our score improved ? We can examine now if any of the coefficient turned into 0.

for coef, col in enumerate(Xd_train.columns):
    print(f'{col}:  {lasso.coef_[coef]}')
    
#[out]
#0:  0.0
#1:  -0.0
#2:  352.84390566637677
#3:  24.52184907058875
#4:  0.0
#5:  0.0
#6:  -0.0
#7:  0.0
#8:  277.38032896613436
#9:  0.0    

We can see 7 out of 10 coefficients are considered 0 to improve the model.

Now we will see one more type of regression Elastic-Net Regression to check if we can make some improvement.

Elastic-Net Regression

Elastic-net regression works with combining the penalties of L1 and L2 regularization. Here we use l1_ratio parameter to control the combination of l1 and l2 regularization. when l1_ratio = 0, L2 regularization dominates and when l1_ratio=1, L2 regularization dominates. values between 0 and 1 gives us combination of both L1 and L2.

Lets fit Elastic-Net with default parameter.

from sklearn.linear_model import ElasticNet

# Train model with default alpha=1 and l1_ratio=0.5
elastic_net = ElasticNet(alpha=1, l1_ratio=0.5).fit(Xd_train, yd_train)# get cross val scores
# get cross val scores
scores_de = cross_val_score(elastic_net,Xd_train,yd_train, cv=5, scoring='r2')
print('CV Mean: ', np.mean(scores_de))
print('STD: ', np.std(scores_de))
print('\n')

#[out]
#CV Mean:  -0.0216950409719
#STD:  0.0393959507574

Lets use grid-search to find optimal value of l1_ratio and alpha.

alpha = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
l1_ratio = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]

param_grid = dict(alpha=alpha, l1_ratio=l1_ratio)
grid = GridSearchCV(estimator=elastic_net, param_grid=param_grid, scoring='r2', verbose=1, n_jobs=-1)

grid_result = grid.fit(X_train, y_train)

print('Best Score: ', grid_result.best_score_)
print('Best Params: ', grid_result.best_params_)

#[OUT]
#Best Score:  0.468818741654
#Best Params:  {'alpha': 0.001, 'l1_ratio': 0.9}

At last we can conclude, we had simplified our model using regularization but unfortunately our R-Square value didn’t improved so well.

This was all done on diabetes data set . We used one more data set as boston house price data set. With simple linear regression only boston data set shows R-Square value as 0.74 and standard deviation as 0.03 which was much better. You can try regularization over it to improve more.

In upcoming article we will explore more ways to improve the model.

Happy Coding……

Leave a Reply

Your email address will not be published. Required fields are marked *