Description
About The Data
Our goal for this lab is construct a model that can take a certain set of housing features and give us
back a price estimate. Since price is a continuous variable, linear regression may be a good place to
start from.
The dataset that we’ll be using for this task comes from kaggle.com and contains the following
attributes:
‘Avg. Area Income’: Avg. income of residents of the city house is located in.
‘Avg. Area House Age’: Avg age of houses in same city
‘Avg. Area Number of Rooms’: Avg number of rooms for houses in same city
‘Avg. Area Number of Bedrooms’: Avg number of bedrooms for houses in same city
‘Area Population’: Population of city house is located in
‘Price’: Price that the house sold at (target)
‘Address’: Address for the house
Exploratory Data Analysis
Let’s begin by importing some necessary libraries that we’ll be using to explore the data.
Our first step is to load the data into a pandas DataFrame
Avg. Area
Income
Avg.
Area
House
Age
Avg.
Area
Number
of
Rooms
Avg. Area
Number
of
Bedrooms
Area
Population
Price Address
0 79545.458574 5.682861 7.009188 4.09 23086.800503 1.059034e+06
208 Michael Ferry Apt.
674\nLaurabury, NE
3701…
1 79248.642455 6.002900 6.730821 3.09 40173.072174 1.505891e+06
188 Johnson Views
Suite 079\nLake
Kathleen, CA…
2 61287.067179 5.865890 8.512727 5.13 36882.159400 1.058988e+06
9127 Elizabeth
Stravenue\nDanieltown,
WI 06482…
3 63345.240046 7.188236 5.586729 3.26 34310.242831 1.260617e+06 USS Barnett\nFPO AP
44820
4 59982.197226 5.040555 7.839388 4.23 26354.109472 6.309435e+05 USNS Raymond\nFPO
AE 09386
From here, it’s always a good step to use describe() and info() to get a better sense of the data and
see if we have any missing values.
Avg. Area
Income
Avg. Area
House Age
Avg. Area
Number of
Rooms
Avg. Area
Number of
Bedrooms
Area
Population
Price
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5.000000e+03
mean 68583.108984 5.977222 6.987792 3.981330 36163.516039 1.232073e+06
std 10657.991214 0.991456 1.005833 1.234137 9925.650114 3.531176e+05
min 17796.631190 2.644304 3.236194 2.000000 172.610686 1.593866e+04
25% 61480.562388 5.322283 6.299250 3.140000 29403.928702 9.975771e+05
50% 68804.286404 5.970429 7.002902 4.050000 36199.406689 1.232669e+06
75% 75783.338666 6.650808 7.665871 4.490000 42861.290769 1.471210e+06
max 107701.748378 9.519088 10.759588 6.500000 69621.713378 2.469066e+06
The info below lets us know that we have 5,000 entries and 5,000 non‑null values in each
feature/column. Therefore, there are no missing values in this dataset.
<class
’pandas.core.frame.DataFrame’>
RangeIndex:
5000
entries,
0
to
4999
Data
columns
(total
7
columns):
#
Column
NonNull
Count
Dtype
0
Avg.
Area
Income
5000
nonnull
float64
1
Avg.
Area
House
Age
5000
nonnull
float64
2
Avg.
Area
Number
of
Rooms
5000
nonnull
float64
3
Avg.
Area
Number
of
Bedrooms
5000
nonnull
float64
4
Area
Population
5000
nonnull
float64
5
Price
5000
nonnull
float64
6
Address
5000
nonnull
object
dtypes:
float64(6),
object(1)
memory
usage:
273.6+
KB
A quick pairplot lets us get an idea of the distributions and relationships in our dataset. From here, we
could choose any interesting features that we’d like to later explore in greater depth. Warning: The
more features in our dataset, the harder our pairplot will be to interpret.
Taking a closer look at price, we see that it’s normally distributed with a peak around 1.232073e+06,
and 75% of houses sold were at a price of 1.471210e+06 or lower.
count
5.000000e+03
mean
1.232073e+06
std
3.531176e+05
min
1.593866e+04
25%
9.975771e+05
50%
1.232669e+06
75%
1.471210e+06
max
2.469066e+06
Name:
Price,
dtype:
float64
A scatterplot of Price vs. Avg. Area Income shows a strong positive linear relationship between the
two.
Creating a boxplot of Avg. Area Number of Bedrooms lets us see that the median average area number
of bedrooms is around 4, with a minimum of 2 and max of around 6.5. We can also so that there are no
outliers present.
Try plotting some of the other features for yourself to see if you can discover some interesting
findings. Refer back to the matplotlib lab if you’re having trouble creating any graphs.
Another important thing to look for while we’re exploring our data is multicollinearity. Multicollinearity
means that several variables are essentially measuring the same thing. Not only is there no point to
having more than one measure of the same thing in a model, but doing so can actually cause our
model results to fluctuate. Luckily, checking for multicollinearity can be done easily with the help of a
heatmap. Note: Depending on the situation, it may not be a problem for your model if only slight
or moderate collinearity issue occur. However, it is strongly advised to solve the issue if severe
collinearity issue exists(e.g. correlation >0.8 between 2 variables)
This dataset is quite clean, and so there’s no severe collinearity issues. We’ll later dive into some
messier datasets which will require some type of feature engineering or PCA to resolve.
Creating Our Linear Model
We’re now ready to begin creating and training our model. We first need to split our data into training
and testing sets. This can be done using sklearn’s train_test_split(X, y, test_size) function. This
function takes in your features (X), the target variable (y), and the test_size you’d like (Generally a test
size of around 0.3 is good enough). It will then return a tuple of X_train, X_test, y_train, y_test sets for
us. We will train our model on the training set and then use the test set to evaluate the model.
We’ll now import sklearn’s LinearRegression model and begin training it using the fit(train_data,
train_data_labels) method. In a nutshell, fitting is equal to training. Then, after it is trained, the model
can be used to make predictions, usually with a predict(test_data) method call. You can think of fit as
the step that finds the coefficients for the equation.
LinearRegression()
Model Evaluation
Now that we’ve finished training, we can make predictions off of the test data and evaluate our
model’s performance using the corresponding test data labels (y_test).
To get a rough idea of how well the model is predicting, we can make a scatterplot with the true test
labels (y_test) on the x‑axis, and our predictions on the y‑axis. Ideally, we’d like a 45 degree line. The
straighter the line, the better our predictions are.
Something that you may recall from MATH 3339 is that we’d like to see the residuals be normally
distributed in regression analysis. We can exam this as follows:
Here are the most common evaluation metrics for regression problems:
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
Mean Squared Error (MSE) is the mean of the squared errors:
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
Comparing these metrics:
MAE is the easiest to understand, because it’s the average error.
MSE is more popular than MAE, because MSE “punishes” larger errors, which tends to be useful in
the real world.
RMSE is even more popular than MSE, because RMSE is interpretable in the “y” units.
All of these are loss functions, because we want to minimize them.
Luckily, sklearn can calculate all of these metrics for us. All we need to do is pass the true labels
(y_test) and our predictions to the functions below. What’s more important is that we understand what
each of these means. Root Mean Square Error (RMSE) is what we’ll most commonly use, which is the
standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the
regression line data points are; RMSE is a measure of how spread out these residuals are. In other
words, it tells us how concentrated the data is around the line of best fit. Determining a good RMSE
depends on your data. You can find a great example here, or refer back to the power points.
MAE:
83410.59496702514
MSE:
10608579825.136667
RMSE:
102997.96029600133
Something we also like to look at is the coefficient of determination ( ), which is the percentage of
variation in y explained by all the x variables together. Usually an of .70 is considered good.
R2
Score:
0.9150208174786678
Finally, let’s see how we can interpret our model’s coefficients. We can access the coefficients by
calling coef_ on our linear model (lm in this case). We’ll use this and put it in a nice pandas DataFrame
for visual purposes. Note: You can also call intercept_ if you’d like to get the intercept.
Coefficient
Avg. Area Income 21.564645
Avg. Area House Age 166102.423648
Avg. Area Number of Rooms 122398.915857
Avg. Area Number of Bedrooms 887.665746
Area Population 15.309706
What these coefficients mean:
Holding all other features fixed, a 1 unit increase in Avg. Area Income is associated with an
increase of $21.564645 .
Holding all other features fixed, a 1 unit increase in Avg. Area House Age is associated with an
increase of $166102.423648 .
Holding all other features fixed, a 1 unit increase in Avg. Area Number of Rooms is associated
with an increase of $122398.915857 .
Holding all other features fixed, a 1 unit increase in Avg. Area Number of Bedrooms is associated
with an increase of $887.665746 .
Holding all other features fixed, a 1 unit increase in Area Population is associated with an
increase of $15.309706 .
Congratulations! You now know how to create and evaluate linear models using sklearn. As extra
practice, I’d recommend now trying to find a used car or similar housing dataset on kaggle.com and
use this notebook as a guide.
In
[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In
[2]:
from matplotlib import rcParams
rcParams[‘figure.figsize’] = 15, 5
sns.set_style(‘darkgrid’)
In
[3]:
housing_data = pd.read_csv(‘USA_Housing.csv’)
housing_data.head()
Out[3]: In
[4]:
housing_data.describe()
Out[4]: In
[5]:
housing_data.info()
In
[6]: sns.pairplot(housing_data)
plt.show()
In
[7]: sns.histplot(housing_data[‘Price’])
plt.show()
print(housing_data[‘Price’].describe())
In
[8]: sns.scatterplot(x=’Price’, y=’Avg.
Area
Income’, data=housing_data)
plt.show()
In
[9]: sns.boxplot(x=’Avg.
Area
Number
of
Bedrooms’, data=housing_data)
plt.show()
In
[10]: sns.heatmap(housing_data.corr(), annot=True)
plt.show()
In
[11]:
from sklearn.model_selection import train_test_split
X = housing_data[[‘Avg.
Area
Income’, ‘Avg.
Area
House
Age’, ‘Avg.
Area
Number
of
Room
‘Avg.
Area
Number
of
Bedrooms’, ‘Area
Population’]]
y = housing_data[‘Price’]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
In
[12]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
Out[12]: In
[13]:
predictions = lm.predict(X_test)
plt.scatter(y_test,predictions)
plt.show()
In
[14]:
residuals = y_test predictions
sns.histplot(residuals)
plt.show()
n∑
i=1
|yi − y^i
|
1
n
n∑
i=1
(yi − y^i)
2 1
n
⎷
n∑
i=1
(yi − y^i)
2 1
n
In
[15]:
from sklearn import metrics
print(‘MAE:’, metrics.mean_absolute_error(y_test, predictions))
print(‘MSE:’, metrics.mean_squared_error(y_test, predictions))
print(‘RMSE:’, np.sqrt(metrics.mean_squared_error(y_test, predictions)))
R2
R2
In
[16]:
from sklearn.metrics import r2_score
print(‘R2
Score:
’, r2_score(y_test, predictions))
In
[17]:
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=[‘Coefficient’])
coeff_df
Out[17]:
COSC 3337



