Sale!

COSC 3337 Week 7 Lab (KNN) and (Naive Bayes) solved

$30.00 $25.50

Category:

Description

5/5 - (3 votes)

About The Data
In this lab you will learn how to use sklearn to build a machine learning model using k‑Nearest Neighbors algorithm to predict
whether the patients in the “Pima Indians Diabetes Dataset” have diabetes or not.
The dataset that we’ll be using for this task comes from kaggle.com and contains the following attributes:
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2‑Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age (in years)
Outcome: Class variable (0 or 1)
Exploratory Data Analysis
Let’s begin by importing some necessary libraries that we’ll be using to explore the data.
Our first step is to load the data into a pandas DataFrame
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
From here, it’s always a good step to use describe() and info() to get a better sense of the data and see if we have any missing
values.
<class
’pandas.core.frame.DataFrame’>
RangeIndex:
768
entries,
0
to
767
Data
columns
(total
9
columns):
#


Column



















Non­Null
Count

Dtype
­­­

­­­­­­



















­­­­­­­­­­­­­­

­­­­­
0


Pregnancies














768
non­null



int64
1


Glucose


















768
non­null



int64
2


BloodPressure












768
non­null



int64
3


SkinThickness












768
non­null



int64
4


Insulin


















768
non­null



int64
5


BMI






















768
non­null



float64
6


DiabetesPedigreeFunction

768
non­null



float64
7


Age






















768
non­null



int64
8


Outcome


















768
non­null



int64
dtypes:
float64(2),
int64(7)
memory
usage:
54.1
KB
Looking at the info summary, we can see that there are 768 entries in the DataFrame, and 768 non‑null entries in each
feature/colum. Thus, there are no missing values, but there is something strange when we look at the describe summary below.
For certain columns below, does a value of zero make sense? For example, if an individual had a glucose or blood pressure level of
0, they’d probably be dead, so it’s likely that the true values were excluded from the data for some reason.
Therefore, we’ll consider the following columns to have missing values where there’s an invalid zero value:
Glucose
BloodPressure
SkinThickness
Insulin
BMI
Let’s go ahead and replace out invalid zero values with nan, since they technically missing values. We’ll go ahead and make a copy
of our diabetes_df and modify the zeros in the copy just incase we need to refer back to the original. We can make copies of
DataFrames using .copy(deep=True). There’s also a very convenient function we can call .replace(x, y) that will replace all x
values with the y value specified.
Before choosing how to impute these missing values, let’s take a look at their distributions.
Since SkinThickness, Insulin, and BMI look skewed, we’ll go ahead and replace their missing values with median instead of mean.
Glucose and BloodPressure should be ok if we stick with mean for imputing. Recall that mean can be effected by outliers.
Let’s first create a heatmap and see if there are any correlations in our dataset.
Interpretation: No significant case of multi collinearity is observed.
Let’s also check out a few scatterplots of our data.
Interpretation:
BMI seems to have a slight increase as blood pressure increases. However, majority of the data seems to be centered and
cluster at around a blood pressure of 50‑95 and BMI of 20‑45. We’ve also got some outliars scattered around the main
cluster.
There’s a very subtle increase in diabetes pedigree function as glucose increases. Majority of the data tends to fall between a
75 and 175 glucose level. We also have some outliars with very high diabetes pedigree function and again the zeros outliars
which were removed in the no_zeros_df
Note: Don’t worry if you can’t replicate the plot to the right. You should have learned about QQ plots in math 3339. In case anyone
needs these type of plots or a certain statistical test with p‑values for their project, statsmodels is a great place to find these.
Shapiro­Wilk:
w:0.969902515411377,
p­value:1.7774986343921384e­11
Kolmogorov­Smirnov:
d:0.969902515411377,
p­value:0.0
Skewness
of
the
data:
0.531677628850459
Interpretation:
The distribution of glucose is unimodal, and appears to be roughly bell shaped, but it’s certainly not a near perfect normal
distribution. The provided Q‑Q plot, Shapiro‑Wilk, and Kolmogorov‑Smirnov tests seem to reject the null hypothesis of the
data being a normal distribution at the .05 significance level. We can also see both by the graph and provided skewness score
(should be about zero for normally distributed data) below that the data has a slight right skew. The distribution peaks at
around 120 with most of the data between 100 and 140.
How does the glucose distribution of people with diabetes vary from those without?
Interpretation:
Majority of people in class 0 lie between 93 and 125, where as majority of people in class 1 lie between 119 and 167. With that
said, this attribute could serve as a good indicator to determine whether somone is diabetic or not since those in class 1 tend
to be in the higher end compared to class 0.
I encourage you to go ahead and explore the dataset some more to see if you can find some more interesting points, but I’ll jump
to the pre‑processing now since the main goal of this lab is KNN.
Pre‑Processing
The most important step here is to standardize our data. Because the KNN classifier predicts the class of a given test observation
by identifying the observations that are nearest to it, the scale of the variables matters. If this is not taken into account, any
variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN
classifier, than variables that are on a small scale.
If you recall from math 3339, data Z is rescaled such that and , and is done through this formula:
But lucky for us sklearn can do all of this for us.
Taking a look at the data again, we see that it is now scaled.
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 0.639947 0.865108 ‑0.033518 0.670643 ‑0.181541 0.166619 0.468492 1.425995
1 ‑0.844885 ‑1.206162 ‑0.529859 ‑0.012301 ‑0.181541 ‑0.852200 ‑0.365061 ‑0.190672
2 1.233880 2.015813 ‑0.695306 ‑0.012301 ‑0.181541 ‑1.332500 0.604397 ‑0.105584
3 ‑0.844885 ‑1.074652 ‑0.529859 ‑0.695245 ‑0.540642 ‑0.633881 ‑0.920763 ‑1.041549
4 ‑1.141852 0.503458 ‑2.680669 0.670643 0.316566 1.549303 5.484909 ‑0.020496
Creating our Model
We’re now ready to begin creating and training our model. We first need to split our data into training and testing sets. This can be
done using sklearn’s train_test_split(X, y, test_size) function. This function takes in your features (X), the target variable (y), and
the test_size you’d like (Generally a test size of around 0.3 is good enough). It will then return a tuple of X_train, X_test, y_train,
y_test sets for us. We will train our model on the training set and then use the test set to evaluate the model.
The above graph shows that the data is biased towards datapoints having outcome value as 0 (diabetes was not present actually).
The number of non‑diabetics is almost twice the number of diabetic patients. This is where an additional parameter stratify can
come in handy. Stratified sampling aims at splitting a data set so that each split is similar with respect to something. In a
classification setting, it is often chosen to ensure that the train and test sets have approximately the same percentage of samples
of each target class as the complete set.
Recall from lecture, KNN requires us to find some optimal k value. How we’ll do this is by plotting different k values on the x axis,
and the model score for that k value on the y‑axis.
Note: You can also plot the error on the y‑axis, which is quite common as well.
The best result seems to be captured at k = 11 thus 11 will be used for the final model. At this value our train and test scores don’t
vary significantly.
0.7532467532467533
Note: You should also take into account cross validation when considering different models. A separate exercise however
will be created covering different cross validation techniques.
Not bad, but could be better. See if you can mess with the data and imporve on this score.
Lastly, let’s just print out a confusion matrix and classification report of our results.
precision



recall

f1­score


support
0






0.79





0.85





0.82






150
1






0.67





0.58





0.62







81
accuracy


























0.75






231
macro
avg






0.73





0.71





0.72






231
weighted
avg






0.75





0.75





0.75






231
[[127

23]
[
34

47]]
Great job! You now know how to use KNeighborsClassifier in sklearn. Try using this on your own dataset and refer back to this
lecture if you get stuck.
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 76
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000
In
[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In
[2]:
from matplotlib import rcParams
rcParams[‘figure.figsize’] = 15, 5
sns.set_style(‘darkgrid’)
In
[3]:
diabetes_df = pd.read_csv(‘diabetes.csv’)
diabetes_df.head()
Out[3]: In
[4]:
diabetes_df.info()
In
[5]:
diabetes_df.describe()
Out[5]: In
[6]:
diabetes_df_copy = diabetes_df.copy(deep=True)
diabetes_df_copy[‘Glucose’] = diabetes_df_copy[‘Glucose’].replace(0,np.NaN)
diabetes_df_copy[‘BloodPressure’] = diabetes_df_copy[‘BloodPressure’].replace(0,np.NaN)
diabetes_df_copy[‘SkinThickness’] = diabetes_df_copy[‘SkinThickness’].replace(0,np.NaN)
diabetes_df_copy[‘Insulin’] = diabetes_df_copy[‘Insulin’].replace(0,np.NaN)
diabetes_df_copy[‘BMI’] = diabetes_df_copy[‘BMI’].replace(0,np.NaN)
In
[7]:
diabetes_df_copy[[‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’, ‘BMI’]].hist(figsize = (20,10))
plt.show()
In
[8]:
diabetes_df_copy[‘Glucose’].fillna(diabetes_df_copy[‘Glucose’].mean(), inplace = True)
diabetes_df_copy[‘BloodPressure’].fillna(diabetes_df_copy[‘BloodPressure’].mean(), inplace = True)
diabetes_df_copy[‘SkinThickness’].fillna(diabetes_df_copy[‘SkinThickness’].median(), inplace = True)
diabetes_df_copy[‘Insulin’].fillna(diabetes_df_copy[‘Insulin’].median(), inplace = True)
diabetes_df_copy[‘BMI’].fillna(diabetes_df_copy[‘BMI’].median(), inplace = True)
In
[9]: sns.heatmap(diabetes_df_copy.corr(), annot=True)
plt.title(‘Correlation
Matrix’)
plt.show()
In
[10]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
#
alpha
parameter
adjusts
the
point
transparency.
points
with
much
more
overlap
will
appear
darker.
sns.scatterplot(x=’BloodPressure’, y=’BMI’, data=diabetes_df_copy, alpha=0.3, ax=axes[0])
axes[0].set_title(‘BloodPressure
VS.
BMI’)
sns.scatterplot(x=’Glucose’, y=’DiabetesPedigreeFunction’, data=diabetes_df_copy, alpha=0.3, ax=axes[1])
axes[1].set_title(‘Glucose
VS.
DPF’)
plt.show()
In
[11]:
import statsmodels.api as sm
import scipy
import pylab
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
sns.histplot(diabetes_df_copy[‘Glucose’], ax=axes[0])
axes[0].set_title(‘Glucose
Distribution’)
sm.qqplot(diabetes_df_copy[‘Glucose’], line=’s’, ax=axes[1])
axes[1].set_title(‘Glucose
Q­Q
Plot’)
pylab.show()
w, p_val = scipy.stats.shapiro(diabetes_df_copy[‘Glucose’])
print(‘Shapiro­Wilk:
\nw:{},
p­value:{}\n’.format(w,p_val))
d, p_val = scipy.stats.kstest(diabetes_df_copy[‘Glucose’], ‘norm’)
print(‘Kolmogorov­Smirnov:
\nd:{},
p­value:{}\n’.format(w,p_val))
print(‘Skewness
of
the
data:
\n{}\n’.format(scipy.stats.skew(diabetes_df_copy[‘Glucose’])))
In
[12]:
class_zero = diabetes_df_copy[(diabetes_df_copy[‘Outcome’] == 0)]
class_one = diabetes_df_copy[(diabetes_df_copy[‘Outcome’] == 1)]
plt.hist(x=class_zero[‘Glucose’], label=’class
0′, alpha=0.5)
plt.hist(x=class_one[‘Glucose’], label=’class
1′, alpha=0.5)
plt.legend()
plt.title(‘Glucose
Distribution’)
plt.show()
In
[13]:
from sklearn.preprocessing import StandardScaler
#
all
columns
except
’Outcome’
X = diabetes_df_copy.drop(‘Outcome’, axis=1)
y = diabetes_df_copy[‘Outcome’]
#
create
our
scaler
object
scaler = StandardScaler()
#
use
our
scaler
object
to
transform/scale
our
data
and
save
it
into
X_scaled
X_scaled = scaler.fit_transform(X)
#
reassign
X
to
a
new
DataFrame
using
the
X_scaled
values.
X = pd.DataFrame(data=X_scaled, columns=X.columns)
In
[14]:
X.head()
Out[14]: In
[15]: sns.countplot(x=diabetes_df_copy[‘Outcome’])
plt.show()
In
[16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
In
[17]:
from sklearn.neighbors import KNeighborsClassifier
#
will
append
scores
here
for
plotting
later
test_scores = []
train_scores = []
#
testing
k
values
from
1­14
for i in range(1,15):
#
create
a
model
with
k=i
knn = KNeighborsClassifier(i)
#
train
the
model
knn.fit(X_train,y_train)

#
append
scores.
train_scores.append(knn.score(X_train,y_train))
test_scores.append(knn.score(X_test,y_test))
In
[20]: sns.lineplot(x=range(1,15), y=train_scores, marker=’*’, label=’Train
Score’)
sns.lineplot(x=range(1,15), y=test_scores, marker=’o’, label=’Test
Score’)
plt.title(‘K
vs.
Score’)
plt.xlabel(‘K’)
plt.ylabel(‘Score’)
plt.show()
In
[21]:
knn = KNeighborsClassifier(11)
knn.fit(X_train,y_train)
knn.score(X_test,y_test)
Out[21]: In
[22]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
y_pred = knn.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))

Week 7 Lab (Naive Bayes)
COSC 3337
About The Data
We’ll be using the Adult Dataset from kaggle for this lab, but feel free to follow along with your own dataset. The dataset contains
the following attributes:
age
workclass
fnlwgt
education
education_num
marital_status
occupation
relationship
race
sex
capital_gain
capital_loss
hours_per_week
native_country
income
Our goal is to predict whether income exceeds $50k/yr based on census data
Exploratory Data Analysis
Let’s begin by importing some necessary libraries that we’ll be using to explore the data.
Our first step is to load the data into a pandas DataFrame. For some reason, this dataset did not come with a header/column
names, so we will specify that when loading the data and manually add the column names ourselves.
Calling .info() we can see that there are no missing values in our dataset since there are 32561 entries in total, and 32561 non‑
null entries in every column.
<class
’pandas.core.frame.DataFrame’>
RangeIndex:
32561
entries,
0
to
32560
Data
columns
(total
15
columns):
#


Column









Non­Null
Count

Dtype
­­­

­­­­­­









­­­­­­­­­­­­­­

­­­­­
0


age












32561
non­null

int64
1


workclass






32561
non­null

object
2


fnlwgt









32561
non­null

int64
3


education






32561
non­null

object
4


education_num


32561
non­null

int64
5


marital_status

32561
non­null

object
6


occupation





32561
non­null

object
7


relationship



32561
non­null

object
8


race











32561
non­null

object
9


sex












32561
non­null

object
10

capital_gain



32561
non­null

int64
11

capital_loss



32561
non­null

int64
12

hours_per_week

32561
non­null

int64
13

native_country

32561
non­null

object
14

income









32561
non­null

object
dtypes:
int64(6),
object(9)
memory
usage:
3.7+
MB
When working with a lot of variables, it’s usually a good idea to keep track of your categorical and numerical columns in a
separate array so that way we can easilly index our dataframe by that array if for some reason we only want to work with the
numerical columns. For example, when calculating correlations we only want to work with the numerical columns else we will get
an error.
Now we can easily explore just categorical or numericals at a time. Let’s begin exploring the categorical variables first.
workclass education marital_status occupation relationship race sex native_country income
0 State‑gov Bachelors Never‑married Adm‑clerical Not‑in‑family White Male United‑States <=50K
1 Self‑emp‑not‑inc Bachelors Married‑civ‑spouse Exec‑managerial Husband White Male United‑States <=50K
2 Private HS‑grad Divorced Handlers‑cleaners Not‑in‑family White Male United‑States <=50K
3 Private 11th Married‑civ‑spouse Handlers‑cleaners Husband Black Male United‑States <=50K
4 Private Bachelors Married‑civ‑spouse Prof‑specialty Wife Black Female Cuba <=50K
Does one sex tend to earn more than the other in this dataset?
Interpretation: majority of our dataset consist of people earning <=50k, but we can see that in both categories (<=50k and >50k),
majority of the men earn more.
What’s the most common education people in our dataset have?
Interpretation: high school, some college, and bachelors degrees seem to be most common in our dataset.
Let’s see how many counts of each race we have in this dataset
Interpretation: our dataset mostly consists of people from the white race category. Thus, inferences based on race from this
dataset could be biased since we do not have enough data from other race categories.
What sort of occupations do we have in our dataset, and which are most common?
Interpretation: Prof‑specialty, craft‑repair, and Exec‑managerial are the top 3 occupations in our dataset. Also, there’s a ‘?’
signifying unknown. We’ll have to make sure to replace those question marks with null/nan values since these should really be
missing values. If you take a look, you’ll see that workclass and native_country also have ‘?’ values, so we’ll replace those with
NaN as follows:
Note: There was a small space infront of the question mark, so make sure to include that if you’re using the same dataset.
After running the cell above, we can see that we have the following missing values, which we’ll have to take care of.
Let’s now briefly explore the numerical variables
age fnlwgt education_num capital_gain capital_loss hours_per_week
0 39 77516 13 2174 0 40
1 50 83311 13 0 0 13
2 38 215646 9 0 0 40
3 53 234721 7 0 0 40
4 28 338409 13 0 0 40
Let’s check if there are any ‘?’ missing values in any of the numerical columns like we had in the categoricals. We can do this by
looping through every variable in the numericals list and printing a note if that column contains a ‘ ?’.
Great, there are no missing vlaues to take care of here, we’ll just have to take care of the categorical missing values in later.
What do the distributions of our numerical variables look like?
I encourage you to go ahead and explore the dataset some more to see if you can find some more interesting points, but I’ll jump
to the pre‑processing now since you should be comfortable exploring datasets by now, and the main goal of this lab is to learn
how to create and evaluate a Naive Bayes model in sklearn.
Pre‑Processing
We’ll first take care of the missing categorical values. One option is to replace the missing values with the most frequent/mode,
which we’ll do below. However, options for dealing with missing categorical variables include:
Remove observations with missing values if we are dealing with a large dataset and the number of records containing missing
values are few.
Remove the variable/column if it is not significant.
Develop a model to predict missing values. KNN for example.
Replace missing values with the most frequent in that column.
Our next step is to encode these categories. Since our categories don’t really have any type of order to preserve, we’ll use one hot
encoding / get dummies. Refer back to lab 5 if you’re having trouble using dummy variables, but we’ll encode as follows:
Let’s now map all of our variables onto the same scale. We’ll follow the same steps as the KNN lab. The only difference from KNN
lab is that here we’re using RobustScaler, which just scales features using statistics that are robust to outliers.
Creating our Model
We’re now ready to begin creating and training our model. We first need to split our data into training and testing sets. This can be
done using sklearn’s train_test_split(X, y, test_size) function. This function takes in your features (X), the target variable (y), and
the test_size you’d like (Generally a test size of around 0.3 is good enough). It will then return a tuple of X_train, X_test, y_train,
y_test sets for us. We will train our model on the training set and then use the test set to evaluate the model.
GaussianNB()
Model Evaluation
Now that we’ve finished training, we can make predictions off of the test data and evaluate our model’s performance using the
corresponding test data labels (y_test).
Check accuracy score:
Model
accuracy
score:
0.8228
Compare the train set and test set accuracy:
Training
set
score:
0.8241
Test
set
score:
0.8228
The training set accuracy score is 0.8241 while the test set accuracy is 0.8228. These two values are quite comparable, so there is
no sign of overfitting.
Confussion matrix results:
Confusion
matrix
[[6299
1090]
[
641
1739]]
True
Positives(TP)
=

6299
True
Negatives(TN)
=

1739
False
Positives(FP)
=

1090
False
Negatives(FN)
=

641
Classification report is another way to evaluate the classification model performance. It displays the precision, recall, f1 and
support scores for the model. Let’s print these as well.
precision



recall

f1­score


support
0






0.91





0.85





0.88





7389
1






0.61





0.73





0.67





2380
accuracy


























0.82





9769
macro
avg






0.76





0.79





0.77





9769
weighted
avg






0.84





0.82





0.83





9769
Let’s also perform k‑Fold Cross Validation (10‑fold below). We can do this using cross_val_score(model, X_train, y_train, k,
scoring)
Cross­validation
scores:[0.82587719
0.82763158
0.82272927
0.81263712
0.83501536
0.82053532
0.82404563
0.83457657
0.81000439
0.82316806]
Average
cross­validation
score:
0.8236
interpretation:
Using the mean cross‑validation, we can conclude that we expect the model to be around 0.8236% accurate on average.
If we look at all the 10 scores produced by the 10‑fold cross‑validation, we can also conclude that there is a relatively small
variance in the accuracy between folds, so we can conclude that the model is independent of the particular folds used for
training.
Great job! You now know how to use a Naive Bayes model in sklearn. Try using this on your own dataset and refer back to this
lecture if you get stuck.
age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capita
0 39 State‑gov 77516 Bachelors 13 Never‑married Adm‑
clerical
Not‑in‑
family
White Male 2174
1 50 Self‑emp‑
not‑inc
83311 Bachelors 13 Married‑civ‑
spouse
Exec‑
managerial Husband White Male 0
2 38 Private 215646 HS‑grad 9 Divorced Handlers‑
cleaners
Not‑in‑
family
White Male 0
3 53 Private 234721 11th 7 Married‑civ‑
spouse
Handlers‑
cleaners
Husband Black Male 0
4 28 Private 338409 Bachelors 13 Married‑civ‑
spouse
Prof‑
specialty
Wife Black Female 0
age fnlwgt education_num capital_gain capital_loss hours_per_week workclass_
Local‑gov
workclass_
Never‑
worked
workclass_
Private
workclass_
Self‑emp‑
inc

0 39 77516 13 2174 0 40 0 0 0 0 …
1 50 83311 13 0 0 13 0 0 0 0 …
2 38 215646 9 0 0 40 0 0 1 0 …
3 53 234721 7 0 0 40 0 0 1 0 …
4 28 338409 13 0 0 40 0 0 1 0 …
5 rows × 98 columns
age fnlwgt education_num capital_gain capital_loss hours_per_week workclass_
Local‑gov
workclass_
Never‑
worked
workclass_
Private
workclass_
Self‑emp‑
inc
0 0.10 ‑0.845803 1.000000 2174.0 0.0 0.0 0 0 0 0
1 0.65 ‑0.797197 1.000000 0.0 0.0 ‑5.4 0 0 0 0
2 0.05 0.312773 ‑0.333333 0.0 0.0 0.0 0 0 1 0
3 0.80 0.472766 ‑1.000000 0.0 0.0 0.0 0 0 1 0
4 ‑0.45 1.342456 1.000000 0.0 0.0 0.0 0 0 1 0
5 rows × 97 columns
In
[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In
[2]:
from matplotlib import rcParams
rcParams[‘figure.figsize’] = 15, 5
sns.set_style(‘darkgrid’)
In
[3]:
adult_df = pd.read_csv(‘adult.csv’, header=None)
adult_df.columns = [‘age’, ‘workclass’, ‘fnlwgt’, ‘education’, ‘education_num’, ‘marital_status’, ‘occupatio
‘relationship’,’race’, ‘sex’, ‘capital_gain’, ‘capital_loss’, ‘hours_per_week’, ‘native_
‘income’]
adult_df.head()
Out[3]: In
[4]:
adult_df.info()
In
[5]:
categoricals = [‘workclass’, ‘education’, ‘marital_status’, ‘occupation’, ‘relationship’, ‘race’, ‘sex’,
‘native_country’, ‘income’]
numericals = [‘age’, ‘fnlwgt’, ‘education_num’, ‘capital_gain’, ‘capital_loss’, ‘hours_per_week’]
In
[6]:
adult_df[categoricals].head()
Out[6]: In
[7]: sns.countplot(x=adult_df[‘income’], hue=’sex’, data=adult_df)
plt.show()
In
[8]:
#
order=
is
an
optional
parameter,
which
is
just
sorting
the
bars
in
this
case.
sns.countplot(x=adult_df[‘education’], order=adult_df[‘education’].value_counts().index)
plt.xticks(rotation=45)
plt.show()
In
[9]: sns.countplot(x=adult_df[‘race’], data=adult_df)
plt.show()
In
[10]: sns.countplot(x=adult_df[‘occupation’], data=adult_df, order=adult_df[‘occupation’].value_counts().index)
plt.xticks(rotation=45)
plt.show()
In
[11]:
adult_df[‘workclass’] = adult_df[‘workclass’].replace(‘
?’, np.NaN)
adult_df[‘occupation’] = adult_df[‘occupation’].replace(‘
?’, np.NaN)
adult_df[‘native_country’] = adult_df[‘native_country’].replace(‘
?’, np.NaN)
In
[12]: sns.barplot(x=adult_df.columns, y=adult_df.isnull().sum().values)
plt.xticks(rotation=45)
plt.show()
In
[13]:
adult_df[numericals].head()
Out[13]: In
[14]:
for variable in numericals:
if not adult_df[adult_df[variable] == ‘
?’].empty:
print(f'{variable}
contains
missing
values
(
?)’)
In
[15]:
adult_df[numericals].hist(figsize=(20, 10))
plt.show()
In
[16]:
adult_df[‘workclass’].fillna(adult_df[‘workclass’].mode()[0], inplace=True)
adult_df[‘occupation’].fillna(adult_df[‘occupation’].mode()[0], inplace=True)
adult_df[‘native_country’].fillna(adult_df[‘native_country’].mode()[0], inplace=True)
In
[17]:
adult_df = pd.get_dummies(data=adult_df, columns=categoricals, drop_first=True)
In
[18]:
adult_df.head()
Out[18]: In
[26]:
from sklearn.preprocessing import RobustScaler
#
all
columns
except
our
target
column
for
X
X = adult_df.drop(‘income_
>50K’, axis=1)
y = adult_df[‘income_
>50K’]
#
create
our
scaler
object
scaler = RobustScaler()
#
use
our
scaler
object
to
transform/scale
our
data
and
save
it
into
X_scaled.
Only
need
to
#
transform
numerical
data.
X_scaled = scaler.fit_transform(X[numericals])
#
reassign
X[numericals]
to
the
transformed
numerical
data.
X[numericals] = X_scaled
In
[27]:
X.head()
Out[27]: In
[28]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
In
[29]:
from sklearn.naive_bayes import GaussianNB
#
instantiate
the
model
to
train
a
Gaussian
Naive
Bayes
classifier
gnb = GaussianNB()
#
fit
the
model
gnb.fit(X_train, y_train)
Out[29]: In
[30]:
y_pred = gnb.predict(X_test)
In
[31]:
from sklearn.metrics import accuracy_score
print(‘Model
accuracy
score:
{0:0.4f}’.format(accuracy_score(y_test, y_pred)))
In
[32]:
y_pred_train = gnb.predict(X_train)
print(‘Training
set
score:
{:.4f}’.format(gnb.score(X_train, y_train)))
print(‘Test
set
score:
{:.4f}’.format(gnb.score(X_test, y_test)))
In
[33]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(‘Confusion
matrix\n\n’, cm)
print(‘\nTrue
Positives(TP)
=
’, cm[0,0])
print(‘\nTrue
Negatives(TN)
=
’, cm[1,1])
print(‘\nFalse
Positives(FP)
=
’, cm[0,1])
print(‘\nFalse
Negatives(FN)
=
’, cm[1,0])
In
[34]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
In
[37]:
from sklearn.model_selection import cross_val_score
#
Applying
10­Fold
Cross
Validation
scores = cross_val_score(gnb, X_train, y_train, cv = 10, scoring=’accuracy’)
print(‘Cross­validation
scores:{}’.format(scores))
#
compute
Average
cross­validation
score
print(‘\nAverage
cross­validation
score:
{:.4f}’.format(scores.mean()))
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
COSC 3337