Name: COSC 3337 Week 7 Lab (KNN) and (Naive Bayes) solved
SKU: 24901
Price: 30.00 USD
Availability: InStock

Description

5/5 - (3 votes)

About The Data
In this lab you will learn how to use sklearn to build a machine learning model using k‑Nearest Neighbors algorithm to predict
whether the patients in the “Pima Indians Diabetes Dataset” have diabetes or not.
The dataset that we’ll be using for this task comes from kaggle.com and contains the following attributes:
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2‑Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age (in years)
Outcome: Class variable (0 or 1)
Exploratory Data Analysis
Let’s begin by importing some necessary libraries that we’ll be using to explore the data.
Our first step is to load the data into a pandas DataFrame
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
From here, it’s always a good step to use describe() and info() to get a better sense of the data and see if we have any missing
values.
<class ’pandas.core.frame.DataFrame’>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
#   Column                    NonNull Count  Dtype
                        
0   Pregnancies               768 nonnull    int64
1   Glucose                   768 nonnull    int64
2   BloodPressure             768 nonnull    int64
3   SkinThickness             768 nonnull    int64
4   Insulin                   768 nonnull    int64
5   BMI                       768 nonnull    float64
6   DiabetesPedigreeFunction  768 nonnull    float64
7   Age                       768 nonnull    int64
8   Outcome                   768 nonnull    int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
Looking at the info summary, we can see that there are 768 entries in the DataFrame, and 768 non‑null entries in each
feature/colum. Thus, there are no missing values, but there is something strange when we look at the describe summary below.
For certain columns below, does a value of zero make sense? For example, if an individual had a glucose or blood pressure level of
0, they’d probably be dead, so it’s likely that the true values were excluded from the data for some reason.
Therefore, we’ll consider the following columns to have missing values where there’s an invalid zero value:
Glucose
BloodPressure
SkinThickness
Insulin
BMI
Let’s go ahead and replace out invalid zero values with nan, since they technically missing values. We’ll go ahead and make a copy
of our diabetes_df and modify the zeros in the copy just incase we need to refer back to the original. We can make copies of
DataFrames using .copy(deep=True). There’s also a very convenient function we can call .replace(x, y) that will replace all x
values with the y value specified.
Before choosing how to impute these missing values, let’s take a look at their distributions.
Since SkinThickness, Insulin, and BMI look skewed, we’ll go ahead and replace their missing values with median instead of mean.
Glucose and BloodPressure should be ok if we stick with mean for imputing. Recall that mean can be effected by outliers.
Let’s first create a heatmap and see if there are any correlations in our dataset.
Interpretation: No significant case of multi collinearity is observed.
Let’s also check out a few scatterplots of our data.
Interpretation:
BMI seems to have a slight increase as blood pressure increases. However, majority of the data seems to be centered and
cluster at around a blood pressure of 50‑95 and BMI of 20‑45. We’ve also got some outliars scattered around the main
cluster.
There’s a very subtle increase in diabetes pedigree function as glucose increases. Majority of the data tends to fall between a
75 and 175 glucose level. We also have some outliars with very high diabetes pedigree function and again the zeros outliars
which were removed in the no_zeros_df
Note: Don’t worry if you can’t replicate the plot to the right. You should have learned about QQ plots in math 3339. In case anyone
needs these type of plots or a certain statistical test with p‑values for their project, statsmodels is a great place to find these.
ShapiroWilk:
w:0.969902515411377, pvalue:1.7774986343921384e11
KolmogorovSmirnov:
d:0.969902515411377, pvalue:0.0
Skewness of the data:
0.531677628850459
Interpretation:
The distribution of glucose is unimodal, and appears to be roughly bell shaped, but it’s certainly not a near perfect normal
distribution. The provided Q‑Q plot, Shapiro‑Wilk, and Kolmogorov‑Smirnov tests seem to reject the null hypothesis of the
data being a normal distribution at the .05 significance level. We can also see both by the graph and provided skewness score
(should be about zero for normally distributed data) below that the data has a slight right skew. The distribution peaks at
around 120 with most of the data between 100 and 140.
How does the glucose distribution of people with diabetes vary from those without?
Interpretation:
Majority of people in class 0 lie between 93 and 125, where as majority of people in class 1 lie between 119 and 167. With that
said, this attribute could serve as a good indicator to determine whether somone is diabetic or not since those in class 1 tend
to be in the higher end compared to class 0.
I encourage you to go ahead and explore the dataset some more to see if you can find some more interesting points, but I’ll jump
to the pre‑processing now since the main goal of this lab is KNN.
Pre‑Processing
The most important step here is to standardize our data. Because the KNN classifier predicts the class of a given test observation
by identifying the observations that are nearest to it, the scale of the variables matters. If this is not taken into account, any
variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN
classifier, than variables that are on a small scale.
If you recall from math 3339, data Z is rescaled such that and , and is done through this formula:
But lucky for us sklearn can do all of this for us.
Taking a look at the data again, we see that it is now scaled.
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 0.639947 0.865108 ‑0.033518 0.670643 ‑0.181541 0.166619 0.468492 1.425995
1 ‑0.844885 ‑1.206162 ‑0.529859 ‑0.012301 ‑0.181541 ‑0.852200 ‑0.365061 ‑0.190672
2 1.233880 2.015813 ‑0.695306 ‑0.012301 ‑0.181541 ‑1.332500 0.604397 ‑0.105584
3 ‑0.844885 ‑1.074652 ‑0.529859 ‑0.695245 ‑0.540642 ‑0.633881 ‑0.920763 ‑1.041549
4 ‑1.141852 0.503458 ‑2.680669 0.670643 0.316566 1.549303 5.484909 ‑0.020496
Creating our Model
We’re now ready to begin creating and training our model. We first need to split our data into training and testing sets. This can be
done using sklearn’s train_test_split(X, y, test_size) function. This function takes in your features (X), the target variable (y), and
the test_size you’d like (Generally a test size of around 0.3 is good enough). It will then return a tuple of X_train, X_test, y_train,
y_test sets for us. We will train our model on the training set and then use the test set to evaluate the model.
The above graph shows that the data is biased towards datapoints having outcome value as 0 (diabetes was not present actually).
The number of non‑diabetics is almost twice the number of diabetic patients. This is where an additional parameter stratify can
come in handy. Stratified sampling aims at splitting a data set so that each split is similar with respect to something. In a
classification setting, it is often chosen to ensure that the train and test sets have approximately the same percentage of samples
of each target class as the complete set.
Recall from lecture, KNN requires us to find some optimal k value. How we’ll do this is by plotting different k values on the x axis,
and the model score for that k value on the y‑axis.
Note: You can also plot the error on the y‑axis, which is quite common as well.
The best result seems to be captured at k = 11 thus 11 will be used for the final model. At this value our train and test scores don’t
vary significantly.
0.7532467532467533
Note: You should also take into account cross validation when considering different models. A separate exercise however
will be created covering different cross validation techniques.
Not bad, but could be better. See if you can mess with the data and imporve on this score.
Lastly, let’s just print out a confusion matrix and classification report of our results.
precision    recall  f1score   support
0       0.79      0.85      0.82       150
1       0.67      0.58      0.62        81
accuracy                           0.75       231
macro avg       0.73      0.71      0.72       231
weighted avg       0.75      0.75      0.75       231
[[127  23]
[ 34  47]]
Great job! You now know how to use KNeighborsClassifier in sklearn. Try using this on your own dataset and refer back to this
lecture if you get stuck.
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 76
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
from matplotlib import rcParams
rcParams[‘figure.figsize’] = 15, 5
sns.set_style(‘darkgrid’)
In [3]:
diabetes_df = pd.read_csv(‘diabetes.csv’)
diabetes_df.head()
Out[3]: In [4]:
diabetes_df.info()
In [5]:
diabetes_df.describe()
Out[5]: In [6]:
diabetes_df_copy = diabetes_df.copy(deep=True)
diabetes_df_copy[‘Glucose’] = diabetes_df_copy[‘Glucose’].replace(0,np.NaN)
diabetes_df_copy[‘BloodPressure’] = diabetes_df_copy[‘BloodPressure’].replace(0,np.NaN)
diabetes_df_copy[‘SkinThickness’] = diabetes_df_copy[‘SkinThickness’].replace(0,np.NaN)
diabetes_df_copy[‘Insulin’] = diabetes_df_copy[‘Insulin’].replace(0,np.NaN)
diabetes_df_copy[‘BMI’] = diabetes_df_copy[‘BMI’].replace(0,np.NaN)
In [7]:
diabetes_df_copy[[‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’, ‘BMI’]].hist(figsize = (20,10))
plt.show()
In [8]:
diabetes_df_copy[‘Glucose’].fillna(diabetes_df_copy[‘Glucose’].mean(), inplace = True)
diabetes_df_copy[‘BloodPressure’].fillna(diabetes_df_copy[‘BloodPressure’].mean(), inplace = True)
diabetes_df_copy[‘SkinThickness’].fillna(diabetes_df_copy[‘SkinThickness’].median(), inplace = True)
diabetes_df_copy[‘Insulin’].fillna(diabetes_df_copy[‘Insulin’].median(), inplace = True)
diabetes_df_copy[‘BMI’].fillna(diabetes_df_copy[‘BMI’].median(), inplace = True)
In [9]: sns.heatmap(diabetes_df_copy.corr(), annot=True)
plt.title(‘Correlation Matrix’)
plt.show()
In [10]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
# alpha parameter adjusts the point transparency. points with much more overlap will appear darker.
sns.scatterplot(x=’BloodPressure’, y=’BMI’, data=diabetes_df_copy, alpha=0.3, ax=axes[0])
axes[0].set_title(‘BloodPressure VS. BMI’)
sns.scatterplot(x=’Glucose’, y=’DiabetesPedigreeFunction’, data=diabetes_df_copy, alpha=0.3, ax=axes[1])
axes[1].set_title(‘Glucose VS. DPF’)
plt.show()
In [11]:
import statsmodels.api as sm
import scipy
import pylab
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
sns.histplot(diabetes_df_copy[‘Glucose’], ax=axes[0])
axes[0].set_title(‘Glucose Distribution’)
sm.qqplot(diabetes_df_copy[‘Glucose’], line=’s’, ax=axes[1])
axes[1].set_title(‘Glucose QQ Plot’)
pylab.show()
w, p_val = scipy.stats.shapiro(diabetes_df_copy[‘Glucose’])
print(‘ShapiroWilk: \nw:{}, pvalue:{}\n’.format(w,p_val))
d, p_val = scipy.stats.kstest(diabetes_df_copy[‘Glucose’], ‘norm’)
print(‘KolmogorovSmirnov: \nd:{}, pvalue:{}\n’.format(w,p_val))
print(‘Skewness of the data: \n{}\n’.format(scipy.stats.skew(diabetes_df_copy[‘Glucose’])))
In [12]:
class_zero = diabetes_df_copy[(diabetes_df_copy[‘Outcome’] == 0)]
class_one = diabetes_df_copy[(diabetes_df_copy[‘Outcome’] == 1)]
plt.hist(x=class_zero[‘Glucose’], label=’class 0′, alpha=0.5)
plt.hist(x=class_one[‘Glucose’], label=’class 1′, alpha=0.5)
plt.legend()
plt.title(‘Glucose Distribution’)
plt.show()
In [13]:
from sklearn.preprocessing import StandardScaler
# all columns except ’Outcome’
X = diabetes_df_copy.drop(‘Outcome’, axis=1)
y = diabetes_df_copy[‘Outcome’]
# create our scaler object
scaler = StandardScaler()
# use our scaler object to transform/scale our data and save it into X_scaled
X_scaled = scaler.fit_transform(X)
# reassign X to a new DataFrame using the X_scaled values.
X = pd.DataFrame(data=X_scaled, columns=X.columns)
In [14]:
X.head()
Out[14]: In [15]: sns.countplot(x=diabetes_df_copy[‘Outcome’])
plt.show()
In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
In [17]:
from sklearn.neighbors import KNeighborsClassifier
# will append scores here for plotting later
test_scores = []
train_scores = []
# testing k values from 114
for i in range(1,15):
# create a model with k=i
knn = KNeighborsClassifier(i)
# train the model
knn.fit(X_train,y_train)

# append scores.
train_scores.append(knn.score(X_train,y_train))
test_scores.append(knn.score(X_test,y_test))
In [20]: sns.lineplot(x=range(1,15), y=train_scores, marker=’*’, label=’Train Score’)
sns.lineplot(x=range(1,15), y=test_scores, marker=’o’, label=’Test Score’)
plt.title(‘K vs. Score’)
plt.xlabel(‘K’)
plt.ylabel(‘Score’)
plt.show()
In [21]:
knn = KNeighborsClassifier(11)
knn.fit(X_train,y_train)
knn.score(X_test,y_test)
Out[21]: In [22]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
y_pred = knn.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))

Week 7 Lab (Naive Bayes)
COSC 3337
About The Data
We’ll be using the Adult Dataset from kaggle for this lab, but feel free to follow along with your own dataset. The dataset contains
the following attributes:
age
workclass
fnlwgt
education
education_num
marital_status
occupation
relationship
race
sex
capital_gain
capital_loss
hours_per_week
native_country
income
Our goal is to predict whether income exceeds $50k/yr based on census data
Exploratory Data Analysis
Let’s begin by importing some necessary libraries that we’ll be using to explore the data.
Our first step is to load the data into a pandas DataFrame. For some reason, this dataset did not come with a header/column
names, so we will specify that when loading the data and manually add the column names ourselves.
Calling .info() we can see that there are no missing values in our dataset since there are 32561 entries in total, and 32561 non‑
null entries in every column.
<class ’pandas.core.frame.DataFrame’>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
#   Column          NonNull Count  Dtype
              
0   age             32561 nonnull  int64
1   workclass       32561 nonnull  object
2   fnlwgt          32561 nonnull  int64
3   education       32561 nonnull  object
4   education_num   32561 nonnull  int64
5   marital_status  32561 nonnull  object
6   occupation      32561 nonnull  object
7   relationship    32561 nonnull  object
8   race            32561 nonnull  object
9   sex             32561 nonnull  object
10  capital_gain    32561 nonnull  int64
11  capital_loss    32561 nonnull  int64
12  hours_per_week  32561 nonnull  int64
13  native_country  32561 nonnull  object
14  income          32561 nonnull  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
When working with a lot of variables, it’s usually a good idea to keep track of your categorical and numerical columns in a
separate array so that way we can easilly index our dataframe by that array if for some reason we only want to work with the
numerical columns. For example, when calculating correlations we only want to work with the numerical columns else we will get
an error.
Now we can easily explore just categorical or numericals at a time. Let’s begin exploring the categorical variables first.
workclass education marital_status occupation relationship race sex native_country income
0 State‑gov Bachelors Never‑married Adm‑clerical Not‑in‑family White Male United‑States <=50K
1 Self‑emp‑not‑inc Bachelors Married‑civ‑spouse Exec‑managerial Husband White Male United‑States <=50K
2 Private HS‑grad Divorced Handlers‑cleaners Not‑in‑family White Male United‑States <=50K
3 Private 11th Married‑civ‑spouse Handlers‑cleaners Husband Black Male United‑States <=50K
4 Private Bachelors Married‑civ‑spouse Prof‑specialty Wife Black Female Cuba <=50K
Does one sex tend to earn more than the other in this dataset?
Interpretation: majority of our dataset consist of people earning <=50k, but we can see that in both categories (<=50k and >50k),
majority of the men earn more.
What’s the most common education people in our dataset have?
Interpretation: high school, some college, and bachelors degrees seem to be most common in our dataset.
Let’s see how many counts of each race we have in this dataset
Interpretation: our dataset mostly consists of people from the white race category. Thus, inferences based on race from this
dataset could be biased since we do not have enough data from other race categories.
What sort of occupations do we have in our dataset, and which are most common?
Interpretation: Prof‑specialty, craft‑repair, and Exec‑managerial are the top 3 occupations in our dataset. Also, there’s a ‘?’
signifying unknown. We’ll have to make sure to replace those question marks with null/nan values since these should really be
missing values. If you take a look, you’ll see that workclass and native_country also have ‘?’ values, so we’ll replace those with
NaN as follows:
Note: There was a small space infront of the question mark, so make sure to include that if you’re using the same dataset.
After running the cell above, we can see that we have the following missing values, which we’ll have to take care of.
Let’s now briefly explore the numerical variables
age fnlwgt education_num capital_gain capital_loss hours_per_week
0 39 77516 13 2174 0 40
1 50 83311 13 0 0 13
2 38 215646 9 0 0 40
3 53 234721 7 0 0 40
4 28 338409 13 0 0 40
Let’s check if there are any ‘?’ missing values in any of the numerical columns like we had in the categoricals. We can do this by
looping through every variable in the numericals list and printing a note if that column contains a ‘ ?’.
Great, there are no missing vlaues to take care of here, we’ll just have to take care of the categorical missing values in later.
What do the distributions of our numerical variables look like?
I encourage you to go ahead and explore the dataset some more to see if you can find some more interesting points, but I’ll jump
to the pre‑processing now since you should be comfortable exploring datasets by now, and the main goal of this lab is to learn
how to create and evaluate a Naive Bayes model in sklearn.
Pre‑Processing
We’ll first take care of the missing categorical values. One option is to replace the missing values with the most frequent/mode,
which we’ll do below. However, options for dealing with missing categorical variables include:
Remove observations with missing values if we are dealing with a large dataset and the number of records containing missing
values are few.
Remove the variable/column if it is not significant.
Develop a model to predict missing values. KNN for example.
Replace missing values with the most frequent in that column.
Our next step is to encode these categories. Since our categories don’t really have any type of order to preserve, we’ll use one hot
encoding / get dummies. Refer back to lab 5 if you’re having trouble using dummy variables, but we’ll encode as follows:
Let’s now map all of our variables onto the same scale. We’ll follow the same steps as the KNN lab. The only difference from KNN
lab is that here we’re using RobustScaler, which just scales features using statistics that are robust to outliers.
Creating our Model
We’re now ready to begin creating and training our model. We first need to split our data into training and testing sets. This can be
done using sklearn’s train_test_split(X, y, test_size) function. This function takes in your features (X), the target variable (y), and
the test_size you’d like (Generally a test size of around 0.3 is good enough). It will then return a tuple of X_train, X_test, y_train,
y_test sets for us. We will train our model on the training set and then use the test set to evaluate the model.
GaussianNB()
Model Evaluation
Now that we’ve finished training, we can make predictions off of the test data and evaluate our model’s performance using the
corresponding test data labels (y_test).
Check accuracy score:
Model accuracy score: 0.8228
Compare the train set and test set accuracy:
Training set score: 0.8241
Test set score: 0.8228
The training set accuracy score is 0.8241 while the test set accuracy is 0.8228. These two values are quite comparable, so there is
no sign of overfitting.
Confussion matrix results:
Confusion matrix
[[6299 1090]
[ 641 1739]]
True Positives(TP) =  6299
True Negatives(TN) =  1739
False Positives(FP) =  1090
False Negatives(FN) =  641
Classification report is another way to evaluate the classification model performance. It displays the precision, recall, f1 and
support scores for the model. Let’s print these as well.
precision    recall  f1score   support
0       0.91      0.85      0.88      7389
1       0.61      0.73      0.67      2380
accuracy                           0.82      9769
macro avg       0.76      0.79      0.77      9769
weighted avg       0.84      0.82      0.83      9769
Let’s also perform k‑Fold Cross Validation (10‑fold below). We can do this using cross_val_score(model, X_train, y_train, k,
scoring)
Crossvalidation scores:[0.82587719 0.82763158 0.82272927 0.81263712 0.83501536 0.82053532
0.82404563 0.83457657 0.81000439 0.82316806]
Average crossvalidation score: 0.8236
interpretation:
Using the mean cross‑validation, we can conclude that we expect the model to be around 0.8236% accurate on average.
If we look at all the 10 scores produced by the 10‑fold cross‑validation, we can also conclude that there is a relatively small
variance in the accuracy between folds, so we can conclude that the model is independent of the particular folds used for
training.
Great job! You now know how to use a Naive Bayes model in sklearn. Try using this on your own dataset and refer back to this
lecture if you get stuck.
age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capita
0 39 State‑gov 77516 Bachelors 13 Never‑married Adm‑
clerical
Not‑in‑
family
White Male 2174
1 50 Self‑emp‑
not‑inc
83311 Bachelors 13 Married‑civ‑
spouse
Exec‑
managerial Husband White Male 0
2 38 Private 215646 HS‑grad 9 Divorced Handlers‑
cleaners
Not‑in‑
family
White Male 0
3 53 Private 234721 11th 7 Married‑civ‑
spouse
Handlers‑
cleaners
Husband Black Male 0
4 28 Private 338409 Bachelors 13 Married‑civ‑
spouse
Prof‑
specialty
Wife Black Female 0
age fnlwgt education_num capital_gain capital_loss hours_per_week workclass_
Local‑gov
workclass_
Never‑
worked
workclass_
Private
workclass_
Self‑emp‑
inc
…
0 39 77516 13 2174 0 40 0 0 0 0 …
1 50 83311 13 0 0 13 0 0 0 0 …
2 38 215646 9 0 0 40 0 0 1 0 …
3 53 234721 7 0 0 40 0 0 1 0 …
4 28 338409 13 0 0 40 0 0 1 0 …
5 rows × 98 columns
age fnlwgt education_num capital_gain capital_loss hours_per_week workclass_
Local‑gov
workclass_
Never‑
worked
workclass_
Private
workclass_
Self‑emp‑
inc
0 0.10 ‑0.845803 1.000000 2174.0 0.0 0.0 0 0 0 0
1 0.65 ‑0.797197 1.000000 0.0 0.0 ‑5.4 0 0 0 0
2 0.05 0.312773 ‑0.333333 0.0 0.0 0.0 0 0 1 0
3 0.80 0.472766 ‑1.000000 0.0 0.0 0.0 0 0 1 0
4 ‑0.45 1.342456 1.000000 0.0 0.0 0.0 0 0 1 0
5 rows × 97 columns
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
from matplotlib import rcParams
rcParams[‘figure.figsize’] = 15, 5
sns.set_style(‘darkgrid’)
In [3]:
adult_df = pd.read_csv(‘adult.csv’, header=None)
adult_df.columns = [‘age’, ‘workclass’, ‘fnlwgt’, ‘education’, ‘education_num’, ‘marital_status’, ‘occupatio
‘relationship’,’race’, ‘sex’, ‘capital_gain’, ‘capital_loss’, ‘hours_per_week’, ‘native_
‘income’]
adult_df.head()
Out[3]: In [4]:
adult_df.info()
In [5]:
categoricals = [‘workclass’, ‘education’, ‘marital_status’, ‘occupation’, ‘relationship’, ‘race’, ‘sex’,
‘native_country’, ‘income’]
numericals = [‘age’, ‘fnlwgt’, ‘education_num’, ‘capital_gain’, ‘capital_loss’, ‘hours_per_week’]
In [6]:
adult_df[categoricals].head()
Out[6]: In [7]: sns.countplot(x=adult_df[‘income’], hue=’sex’, data=adult_df)
plt.show()
In [8]:
# order= is an optional parameter, which is just sorting the bars in this case.
sns.countplot(x=adult_df[‘education’], order=adult_df[‘education’].value_counts().index)
plt.xticks(rotation=45)
plt.show()
In [9]: sns.countplot(x=adult_df[‘race’], data=adult_df)
plt.show()
In [10]: sns.countplot(x=adult_df[‘occupation’], data=adult_df, order=adult_df[‘occupation’].value_counts().index)
plt.xticks(rotation=45)
plt.show()
In [11]:
adult_df[‘workclass’] = adult_df[‘workclass’].replace(‘ ?’, np.NaN)
adult_df[‘occupation’] = adult_df[‘occupation’].replace(‘ ?’, np.NaN)
adult_df[‘native_country’] = adult_df[‘native_country’].replace(‘ ?’, np.NaN)
In [12]: sns.barplot(x=adult_df.columns, y=adult_df.isnull().sum().values)
plt.xticks(rotation=45)
plt.show()
In [13]:
adult_df[numericals].head()
Out[13]: In [14]:
for variable in numericals:
if not adult_df[adult_df[variable] == ‘ ?’].empty:
print(f'{variable} contains missing values ( ?)’)
In [15]:
adult_df[numericals].hist(figsize=(20, 10))
plt.show()
In [16]:
adult_df[‘workclass’].fillna(adult_df[‘workclass’].mode()[0], inplace=True)
adult_df[‘occupation’].fillna(adult_df[‘occupation’].mode()[0], inplace=True)
adult_df[‘native_country’].fillna(adult_df[‘native_country’].mode()[0], inplace=True)
In [17]:
adult_df = pd.get_dummies(data=adult_df, columns=categoricals, drop_first=True)
In [18]:
adult_df.head()
Out[18]: In [26]:
from sklearn.preprocessing import RobustScaler
# all columns except our target column for X
X = adult_df.drop(‘income_ >50K’, axis=1)
y = adult_df[‘income_ >50K’]
# create our scaler object
scaler = RobustScaler()
# use our scaler object to transform/scale our data and save it into X_scaled. Only need to
# transform numerical data.
X_scaled = scaler.fit_transform(X[numericals])
# reassign X[numericals] to the transformed numerical data.
X[numericals] = X_scaled
In [27]:
X.head()
Out[27]: In [28]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
In [29]:
from sklearn.naive_bayes import GaussianNB
# instantiate the model to train a Gaussian Naive Bayes classifier
gnb = GaussianNB()
# fit the model
gnb.fit(X_train, y_train)
Out[29]: In [30]:
y_pred = gnb.predict(X_test)
In [31]:
from sklearn.metrics import accuracy_score
print(‘Model accuracy score: {0:0.4f}’.format(accuracy_score(y_test, y_pred)))
In [32]:
y_pred_train = gnb.predict(X_train)
print(‘Training set score: {:.4f}’.format(gnb.score(X_train, y_train)))
print(‘Test set score: {:.4f}’.format(gnb.score(X_test, y_test)))
In [33]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(‘Confusion matrix\n\n’, cm)
print(‘\nTrue Positives(TP) = ’, cm[0,0])
print(‘\nTrue Negatives(TN) = ’, cm[1,1])
print(‘\nFalse Positives(FP) = ’, cm[0,1])
print(‘\nFalse Negatives(FN) = ’, cm[1,0])
In [34]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
In [37]:
from sklearn.model_selection import cross_val_score
# Applying 10Fold Cross Validation
scores = cross_val_score(gnb, X_train, y_train, cv = 10, scoring=’accuracy’)
print(‘Crossvalidation scores:{}’.format(scores))
# compute Average crossvalidation score
print(‘\nAverage crossvalidation score: {:.4f}’.format(scores.mean()))
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
COSC 3337

COSC 3337 Week 7 Lab (KNN) and (Naive Bayes) solved

Description

Related products

COSC 3337 Week 11 Lab (Hierarchical Clustering) solved

COSC 3337 Week 8 Lab (Ridge and Lasso) solved

COSC 3337 Week 2 Lab (Intro to pandas) solved