Description

5/5 - (4 votes)

Regularization
Regularization is an important concept that is used to avoid overfitting of the data, especially when the trained and test data are
varying much.
Regularization is implemented by adding a “penalty” term to the best fit derived from the trained data, to achieve a lesser variance
with the tested data and also restricts the influence of predictor variables over the output variable by compressing their
coefficients.
In regularization, what we do is normally we keep the same number of features but reduce the magnitude of the coefficients. We
can reduce the magnitude of the coefficients by using different types of regression techniques which uses regularization to
overcome this problem. So, let us discuss them.
About The Data
For this lab, we’ll be revisiting the lab 4 housing dataset and comparing the below regression techniques:
Linear Regression
Ridge Regression
Lasso Regression
The dataset that we’ll be using for this task comes from kaggle.com and contains the following attributes:
‘Avg. Area Income’: Avg. income of residents of the city house is located in.
‘Avg. Area House Age’: Avg age of houses in same city
‘Avg. Area Number of Rooms’: Avg number of rooms for houses in same city
‘Avg. Area Number of Bedrooms’: Avg number of bedrooms for houses in same city
‘Area Population’: Population of city house is located in
‘Price’: Price that the house sold at (target)
‘Address’: Address for the house
Exploratory Data Analysis
Let’s begin by importing some necessary libraries that we’ll be using to explore the data.
Our first step is to load the data into a pandas DataFrame
Avg. Area
Income
Avg. Area
House Age
Avg. Area Number
of Rooms
Avg. Area Number of
Bedrooms
Area
Population
Price Address
0 79545.458574 5.682861 7.009188 4.09 23086.800503 1.059034e+06
208 Michael Ferry Apt.
674\nLaurabury, NE
3701…
1 79248.642455 6.002900 6.730821 3.09 40173.072174 1.505891e+06 188 Johnson Views Suite
079\nLake Kathleen, CA…
2 61287.067179 5.865890 8.512727 5.13 36882.159400 1.058988e+06
9127 Elizabeth
Stravenue\nDanieltown,
WI 06482…
3 63345.240046 7.188236 5.586729 3.26 34310.242831 1.260617e+06 USS Barnett\nFPO AP
44820
4 59982.197226 5.040555 7.839388 4.23 26354.109472 6.309435e+05 USNS Raymond\nFPO AE
09386
From here, it’s always a good step to use describe() and info() to get a better sense of the data and see if we have any missing
values.
Avg. Area
Income
Avg. Area House
Age
Avg. Area Number of
Rooms
Avg. Area Number of
Bedrooms
Area
Population
Price
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5.000000e+03
mean 68583.108984 5.977222 6.987792 3.981330 36163.516039 1.232073e+06
std 10657.991214 0.991456 1.005833 1.234137 9925.650114 3.531176e+05
min 17796.631190 2.644304 3.236194 2.000000 172.610686 1.593866e+04
25% 61480.562388 5.322283 6.299250 3.140000 29403.928702 9.975771e+05
50% 68804.286404 5.970429 7.002902 4.050000 36199.406689 1.232669e+06
75% 75783.338666 6.650808 7.665871 4.490000 42861.290769 1.471210e+06
max 107701.748378 9.519088 10.759588 6.500000 69621.713378 2.469066e+06
The info below lets us know that we have 5,000 entries and 5,000 non‑null values in each feature/column. Therefore, there are no
missing values in this dataset.
<class ’pandas.core.frame.DataFrame’>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
#   Column                        NonNull Count  Dtype
                            
0   Avg. Area Income              5000 nonnull   float64
1   Avg. Area House Age           5000 nonnull   float64
2   Avg. Area Number of Rooms     5000 nonnull   float64
3   Avg. Area Number of Bedrooms  5000 nonnull   float64
4   Area Population               5000 nonnull   float64
5   Price                         5000 nonnull   float64
6   Address                       5000 nonnull   object
dtypes: float64(6), object(1)
memory usage: 273.6+ KB
A quick pairplot lets us get an idea of the distributions and relationships in our dataset. From here, we could choose any
interesting features that we’d like to later explore in greater depth. Warning: The more features in our dataset, the harder our
pairplot will be to interpret.
Taking a closer look at price, we see that it’s normally distributed with a peak around 1.232073e+06, and 75% of houses sold were
at a price of 1.471210e+06 or lower.
count    5.000000e+03
mean     1.232073e+06
std      3.531176e+05
min      1.593866e+04
25%      9.975771e+05
50%      1.232669e+06
75%      1.471210e+06
max      2.469066e+06
Name: Price, dtype: float64
A scatterplot of Price vs. Avg. Area Income shows a strong positive linear relationship between the two.
Creating a boxplot of Avg. Area Number of Bedrooms lets us see that the median average area number of bedrooms is around 4,
with a minimum of 2 and max of around 6.5. We can also so that there are no outliers present.
Try plotting some of the other features for yourself to see if you can discover some interesting findings. Refer back to the
matplotlib lab if you’re having trouble creating any graphs.
Another important thing to look for while we’re exploring our data is multicollinearity. Multicollinearity means that several variables
are essentially measuring the same thing. Not only is there no point to having more than one measure of the same thing in a
model, but doing so can actually cause our model results to fluctuate. Luckily, checking for multicollinearity can be done easily
with the help of a heatmap. Note: Depending on the situation, it may not be a problem for your model if only slight or
moderate collinearity issue occur. However, it is strongly advised to solve the issue if severe collinearity issue exists(e.g.
correlation >0.8 between 2 variables)
No severe collinearity issues.
Train Test Split
We’re now ready to begin creating and training our model. We first need to split our data into training and testing sets. This can be
done using sklearn’s train_test_split(X, y, test_size) function. This function takes in your features (X), the target variable (y), and
the test_size you’d like (Generally a test size of around 0.3 is good enough). It will then return a tuple of X_train, X_test, y_train,
y_test sets for us. We will train our model on the training set and then use the test set to evaluate the model.
Metrics
For the following models, we’ll take a look at some of the following metrics:
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
Mean Squared Error (MSE) is the mean of the squared errors:
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
Comparing these metrics:
MAE is the easiest to understand, because it’s the average error.
MSE is more popular than MAE, because MSE “punishes” larger errors, which tends to be useful in the real world.
RMSE is even more popular than MSE, because RMSE is interpretable in the “y” units.
All of these are loss functions, because we want to minimize them.
Luckily, sklearn can calculate all of these metrics for us. All we need to do is pass the true labels (y_test) and our predictions to
the functions below. What’s more important is that we understand what each of these means. Root Mean Square Error (RMSE) is
what we’ll most commonly use, which is the standard deviation of the residuals (prediction errors). Residuals are a measure of
how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells
us how concentrated the data is around the line of best fit. Determining a good RMSE depends on your data. You can find a great
example here, or refer back to the power points.
Something we also like to look at is the coefficient of determination ( ), which is the percentage of variation in y explained by all
the x variables together. Usually an of .70 is considered good.
Linear Regression
We’ll now import sklearn’s LinearRegression model and begin training it using the fit(train_data, train_data_labels) method. In a
nutshell, fitting is equal to training. Then, after it is trained, the model can be used to make predictions, usually with a
predict(test_data) method call. You can think of fit as the step that finds the coefficients for the equation.
MAE: 81651.4761859681
MSE: 10236567587.580854
RMSE: 101175.92395219751
R2 Score:  0.9184450322767024
Ridge Regression
Ridge Regression is a technique for analyzing multiple regression data that suffers from multicollinearity. When multicollinearity
occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value. By adding a
degree of bias to the regression estimates, ridge regression reduces the standard errors. It is hoped that the net effect will be to
give estimates that are more reliable.
MAE: 83710.98823583909
MSE: 10808189204.007122
RMSE: 103962.44131419348
R2 Score:  0.913821510832677
Lasso Regression
The “LASSO” stands for Least Absolute Shrinkage and Selection Operator. This model uses shrinkage. Shrinkage is where data
values are shrunk towards a central point as the mean. The lasso procedure encourages simple, sparse models (i.e. models with
fewer parameters). This particular type of regression is well‑suited for models showing high levels of multicollinearity or when you
want to automate certain parts of model selection, like variable selection/parameter elimination.
The key difference to remember here is that Lasso shrinks the less important feature’s coefficient to zero, thus removing some
feature altogether. So, this works well for feature selection in case we have a huge number of features.
MAE: 82878.98244128452
MSE: 10568241154.446312
RMSE: 102801.95112178713
R2 Score:  0.9157347231200905
Congrats! you now know how to create ridge and lasso models using sklearn and different available metrics. However, it’s
more important that you know when it’s appropriate to use these models. For more detail, please refer back to the lecture video
and or slides.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
from matplotlib import rcParams
rcParams[‘figure.figsize’] = 15, 5
sns.set_style(‘darkgrid’)
In [3]:
housing_data = pd.read_csv(‘USA_Housing.csv’)
housing_data.head()
Out[3]: In [4]:
housing_data.describe()
Out[4]: In [5]:
housing_data.info()
In [6]: sns.pairplot(housing_data)
plt.show()
In [7]: sns.histplot(housing_data[‘Price’])
plt.show()
print(housing_data[‘Price’].describe())
In [8]: sns.scatterplot(x=’Price’, y=’Avg. Area Income’, data=housing_data)
plt.show()
In [9]: sns.boxplot(x=’Avg. Area Number of Bedrooms’, data=housing_data)
plt.show()
In [10]: sns.heatmap(housing_data.corr(), annot=True)
plt.show()
In [28]:
from sklearn.model_selection import train_test_split
X = housing_data[[‘Avg. Area Income’, ‘Avg. Area House Age’, ‘Avg. Area Number of Rooms’,
‘Avg. Area Number of Bedrooms’, ‘Area Population’]]
y = housing_data[‘Price’]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
n∑
i=1
|yi − y^i
|
1
n
n∑
i=1
(yi − y^i
)
2 1
n

⎷
n∑
i=1
(yi − y^i
)
2 1
n
R2
R2
In [15]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
# pass our X_test data through the model ﴾lm﴿ to get our predictions
predictions = lm.predict(X_test)
In [16]:
from sklearn import metrics
from sklearn.metrics import r2_score
# printing metrics
print(‘MAE:’, metrics.mean_absolute_error(y_test, predictions))
print(‘MSE:’, metrics.mean_squared_error(y_test, predictions))
print(‘RMSE:’, np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print(‘R2 Score: ’, r2_score(y_test, predictions))
In [29]:
from sklearn.linear_model import Ridge
ridge = Ridge(alpha = 0.05, normalize = True)
ridge.fit(X_train, y_train)
predictions_ridge = ridge.predict(X_test)
# printing metrics
print(‘MAE:’, metrics.mean_absolute_error(y_test, predictions_ridge))
print(‘MSE:’, metrics.mean_squared_error(y_test, predictions_ridge))
print(‘RMSE:’, np.sqrt(metrics.mean_squared_error(y_test, predictions_ridge)))
print(‘R2 Score: ’, r2_score(y_test, predictions_ridge))
In [30]:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha = 0.05, normalize = True)
lasso.fit(X_train, y_train)
predictions_lasso = lasso.predict(X_test)
# printing metrics
print(‘MAE:’, metrics.mean_absolute_error(y_test, predictions_lasso))
print(‘MSE:’, metrics.mean_squared_error(y_test, predictions_lasso))
print(‘RMSE:’, np.sqrt(metrics.mean_squared_error(y_test, predictions_lasso)))
print(‘R2 Score: ’, r2_score(y_test, predictions_lasso))
COSC 3337

COSC 3337 Week 8 Lab (Ridge and Lasso) solved

Download Details:

Description

COSC 3337 Week 8 Lab (Ridge and Lasso) solved

Download Details:

Description

Related products

COSC 3337 Week 2 Lab (Intro to pandas) solved

COSC 3337 Week 7 Lab (KNN) and (Naive Bayes) solved

COSC 3337 Week 12 Lab (DBSCAN) solved