Description
About The Data
We’ll be using the Breast Cancer Wisconsin (Diagnostic) Data Set from kaggle for this lab, but feel free to follow along with your
own dataset. The dataset contains a total of 32 columns, with following attribute information:
1) ID number
2) Diagnosis (M = malignant, B = benign)
3‑32)
Ten real‑valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray‑scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area ‑ 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension (“coastline approximation” ‑ 1)
The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each
image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
Our goal will be to predict the diagnosis (benign or malignant).
Exploratory Data Analysis
Let’s begin by importing some necessary libraries that we’ll be using to explore the data.
Our first step is to load the data into a pandas DataFrame.
There’s an odd column “Unnamed: 32”, which we’ll go ahead and drop since it’s full of NaN values. We also won’t need the id
label, so we can drop that as well.
Since a lot of the features in this dataset can be hard to interpret without domain knowledge of cancer or tumor cells, we’ll just do
a few visualizations here, but feel free to explore as much as you’d like before constructing a model.
<class
’pandas.core.frame.DataFrame’>
RangeIndex:
569
entries,
0
to
568
Data
columns
(total
31
columns):
#
Column
NonNull
Count
Dtype
0
diagnosis
569
nonnull
object
1
radius_mean
569
nonnull
float64
2
texture_mean
569
nonnull
float64
3
perimeter_mean
569
nonnull
float64
4
area_mean
569
nonnull
float64
5
smoothness_mean
569
nonnull
float64
6
compactness_mean
569
nonnull
float64
7
concavity_mean
569
nonnull
float64
8
concave
points_mean
569
nonnull
float64
9
symmetry_mean
569
nonnull
float64
10
fractal_dimension_mean
569
nonnull
float64
11
radius_se
569
nonnull
float64
12
texture_se
569
nonnull
float64
13
perimeter_se
569
nonnull
float64
14
area_se
569
nonnull
float64
15
smoothness_se
569
nonnull
float64
16
compactness_se
569
nonnull
float64
17
concavity_se
569
nonnull
float64
18
concave
points_se
569
nonnull
float64
19
symmetry_se
569
nonnull
float64
20
fractal_dimension_se
569
nonnull
float64
21
radius_worst
569
nonnull
float64
22
texture_worst
569
nonnull
float64
23
perimeter_worst
569
nonnull
float64
24
area_worst
569
nonnull
float64
25
smoothness_worst
569
nonnull
float64
26
compactness_worst
569
nonnull
float64
27
concavity_worst
569
nonnull
float64
28
concave
points_worst
569
nonnull
float64
29
symmetry_worst
569
nonnull
float64
30
fractal_dimension_worst
569
nonnull
float64
dtypes:
float64(30),
object(1)
memory
usage:
137.9+
KB
calling .info() we see that there are no missing values in this dataset.
There seems to be pretty good distinction between the diagnosis (blue & orange) in most of the atributes above.
Majority of our data observations are of the benign class.
area_mean could be a good predictor wheather malignant or benign since there is pretty good separation here. Most benign
(orange) have area_mean of around 500 or lower.
Some strong correlations are present. (very bright squares for example)
Pre‑Processing
Let’s go ahead and scale our data before training and creating our model
Creating our Model
We’re now ready to begin creating and training our model. We first need to split our data into training and testing sets. This can be
done using sklearn’s train_test_split(X, y, test_size) function. This function takes in your features (X), the target variable (y), and
the test_size you’d like (Generally a test size of around 0.3 is good enough). It will then return a tuple of X_train, X_test, y_train,
y_test sets for us. We will train our model on the training set and then use the test set to evaluate the model.
SVC()
Model Evaluation
Now that we’ve finished training, we can make predictions off of the test data and evaluate our model’s performance using the
corresponding test data.
Confusion
matrix
[[104
1]
[
3
63]]
True
Positives(TP)
=
104
True
Negatives(TN)
=
63
False
Positives(FP)
=
1
False
Negatives(FN)
=
3
precision
recall
f1score
support
B
0.97
0.99
0.98
105
M
0.98
0.95
0.97
66
accuracy
0.98
171
macro
avg
0.98
0.97
0.98
171
weighted
avg
0.98
0.98
0.98
171
Hyperparameter Tuning
Finding the right parameters (like what C or gamma values to use) is a tricky task, but luckily we can be a little lazy and just try a
bunch of combinations and see what works best. This idea of creating a ‘grid’ of parameters and just trying out all the possible
combinations is called a Gridsearch, this method is common enough that Scikit‑learn has this functionality built in with
GridSearchCV. The CV stands for cross‑validation.
GridSearchCV takes a dictionary that describes the parameters that should be tried and a model to train. The grid of parameters is
defined as a dictionary, where the keys are the parameters and the values are the settings to be tested. Let’s go ahead and try a
few different parameters to see which of them is the best set to use.
You should add refit=True and choose verbose to whatever number you want. The higher the number, the more verbose.(verbose
just means the text output describing the process).
What fit does is a bit more involved than usual. First, it runs the same loop with cross‑validation to find the best parameter
combination. Once it has the best combination, it runs fit again on all data passed to fit (without cross‑validation), to build a single
new model using the best parameter setting.
Note: This process may take a while. The more parameters we test, the longer it may take since it has to try all different
combinations inorder to find the best set.
Fitting
5
folds
for
each
of
25
candidates,
totalling
125
fits
[CV]
C=0.1,
gamma=1,
kernel=rbf
………………………………..
[CV]
……….
C=0.1,
gamma=1,
kernel=rbf,
score=0.637,
total=
0.0s
[CV]
C=0.1,
gamma=1,
kernel=rbf
………………………………..
[CV]
……….
C=0.1,
gamma=1,
kernel=rbf,
score=0.637,
total=
0.0s
[CV]
C=0.1,
gamma=1,
kernel=rbf
………………………………..
[CV]
……….
C=0.1,
gamma=1,
kernel=rbf,
score=0.625,
total=
0.0s
[CV]
C=0.1,
gamma=1,
kernel=rbf
………………………………..
[CV]
……….
C=0.1,
gamma=1,
kernel=rbf,
score=0.633,
total=
0.0s
[CV]
C=0.1,
gamma=1,
kernel=rbf
………………………………..
[CV]
……….
C=0.1,
gamma=1,
kernel=rbf,
score=0.633,
total=
0.0s
[CV]
C=0.1,
gamma=0.1,
kernel=rbf
………………………………
[CV]
……..
C=0.1,
gamma=0.1,
kernel=rbf,
score=0.925,
total=
0.0s
[CV]
C=0.1,
gamma=0.1,
kernel=rbf
………………………………
[CV]
……..
C=0.1,
gamma=0.1,
kernel=rbf,
score=0.950,
total=
0.0s
[CV]
C=0.1,
gamma=0.1,
kernel=rbf
………………………………
[CV]
……..
C=0.1,
gamma=0.1,
kernel=rbf,
score=0.900,
total=
0.0s
[CV]
C=0.1,
gamma=0.1,
kernel=rbf
………………………………
[CV]
……..
C=0.1,
gamma=0.1,
kernel=rbf,
score=0.962,
total=
0.0s
[CV]
C=0.1,
gamma=0.1,
kernel=rbf
………………………………
[CV]
……..
C=0.1,
gamma=0.1,
kernel=rbf,
score=0.949,
total=
0.0s
[CV]
C=0.1,
gamma=0.01,
kernel=rbf
……………………………..
[CV]
…….
C=0.1,
gamma=0.01,
kernel=rbf,
score=0.912,
total=
0.0s
[CV]
C=0.1,
gamma=0.01,
kernel=rbf
……………………………..
[CV]
…….
C=0.1,
gamma=0.01,
kernel=rbf,
score=0.963,
total=
0.0s
[CV]
C=0.1,
gamma=0.01,
kernel=rbf
……………………………..
[CV]
…….
C=0.1,
gamma=0.01,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=0.1,
gamma=0.01,
kernel=rbf
……………………………..
[CV]
…….
C=0.1,
gamma=0.01,
kernel=rbf,
score=0.987,
total=
0.0s
[CV]
C=0.1,
gamma=0.01,
kernel=rbf
……………………………..
[CV]
…….
C=0.1,
gamma=0.01,
kernel=rbf,
score=0.962,
total=
0.0s
[CV]
C=0.1,
gamma=0.001,
kernel=rbf
…………………………….
[CV]
……
C=0.1,
gamma=0.001,
kernel=rbf,
score=0.688,
total=
0.0s
[CV]
C=0.1,
gamma=0.001,
kernel=rbf
…………………………….
[CV]
……
C=0.1,
gamma=0.001,
kernel=rbf,
score=0.688,
total=
0.0s
[CV]
C=0.1,
gamma=0.001,
kernel=rbf
…………………………….
[CV]
……
C=0.1,
gamma=0.001,
kernel=rbf,
score=0.688,
total=
0.0s
[CV]
C=0.1,
gamma=0.001,
kernel=rbf
…………………………….
[CV]
……
C=0.1,
gamma=0.001,
kernel=rbf,
score=0.684,
total=
0.0s
[CV]
C=0.1,
gamma=0.001,
kernel=rbf
…………………………….
[CV]
……
C=0.1,
gamma=0.001,
kernel=rbf,
score=0.709,
total=
0.0s
[CV]
C=0.1,
gamma=0.0001,
kernel=rbf
……………………………
[CV]
…..
C=0.1,
gamma=0.0001,
kernel=rbf,
score=0.637,
total=
0.0s
[CV]
C=0.1,
gamma=0.0001,
kernel=rbf
……………………………
[CV]
…..
C=0.1,
gamma=0.0001,
kernel=rbf,
score=0.637,
total=
0.0s
[CV]
C=0.1,
gamma=0.0001,
kernel=rbf
……………………………
[CV]
…..
C=0.1,
gamma=0.0001,
kernel=rbf,
score=0.625,
total=
0.0s
[CV]
C=0.1,
gamma=0.0001,
kernel=rbf
……………………………
[CV]
…..
C=0.1,
gamma=0.0001,
kernel=rbf,
score=0.633,
total=
0.0s
[CV]
C=0.1,
gamma=0.0001,
kernel=rbf
……………………………
[Parallel(n_jobs=1)]:
Using
backend
SequentialBackend
with
1
concurrent
workers.
[Parallel(n_jobs=1)]:
Done
1
out
of
1
|
elapsed:
0.0s
remaining:
0.0s
[Parallel(n_jobs=1)]:
Done
2
out
of
2
|
elapsed:
0.0s
remaining:
0.0s
[CV]
…..
C=0.1,
gamma=0.0001,
kernel=rbf,
score=0.633,
total=
0.0s
[CV]
C=1,
gamma=1,
kernel=rbf
………………………………….
[CV]
…………
C=1,
gamma=1,
kernel=rbf,
score=0.637,
total=
0.0s
[CV]
C=1,
gamma=1,
kernel=rbf
………………………………….
[CV]
…………
C=1,
gamma=1,
kernel=rbf,
score=0.637,
total=
0.0s
[CV]
C=1,
gamma=1,
kernel=rbf
………………………………….
[CV]
…………
C=1,
gamma=1,
kernel=rbf,
score=0.625,
total=
0.0s
[CV]
C=1,
gamma=1,
kernel=rbf
………………………………….
[CV]
…………
C=1,
gamma=1,
kernel=rbf,
score=0.633,
total=
0.0s
[CV]
C=1,
gamma=1,
kernel=rbf
………………………………….
[CV]
…………
C=1,
gamma=1,
kernel=rbf,
score=0.633,
total=
0.0s
[CV]
C=1,
gamma=0.1,
kernel=rbf
………………………………..
[CV]
……….
C=1,
gamma=0.1,
kernel=rbf,
score=0.950,
total=
0.0s
[CV]
C=1,
gamma=0.1,
kernel=rbf
………………………………..
[CV]
……….
C=1,
gamma=0.1,
kernel=rbf,
score=0.950,
total=
0.0s
[CV]
C=1,
gamma=0.1,
kernel=rbf
………………………………..
[CV]
……….
C=1,
gamma=0.1,
kernel=rbf,
score=0.963,
total=
0.0s
[CV]
C=1,
gamma=0.1,
kernel=rbf
………………………………..
[CV]
……….
C=1,
gamma=0.1,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=1,
gamma=0.1,
kernel=rbf
………………………………..
[CV]
……….
C=1,
gamma=0.1,
kernel=rbf,
score=0.987,
total=
0.0s
[CV]
C=1,
gamma=0.01,
kernel=rbf
……………………………….
[CV]
………
C=1,
gamma=0.01,
kernel=rbf,
score=0.950,
total=
0.0s
[CV]
C=1,
gamma=0.01,
kernel=rbf
……………………………….
[CV]
………
C=1,
gamma=0.01,
kernel=rbf,
score=0.988,
total=
0.0s
[CV]
C=1,
gamma=0.01,
kernel=rbf
……………………………….
[CV]
………
C=1,
gamma=0.01,
kernel=rbf,
score=0.988,
total=
0.0s
[CV]
C=1,
gamma=0.01,
kernel=rbf
……………………………….
[CV]
………
C=1,
gamma=0.01,
kernel=rbf,
score=0.987,
total=
0.0s
[CV]
C=1,
gamma=0.01,
kernel=rbf
……………………………….
[CV]
………
C=1,
gamma=0.01,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=1,
gamma=0.001,
kernel=rbf
………………………………
[CV]
……..
C=1,
gamma=0.001,
kernel=rbf,
score=0.912,
total=
0.0s
[CV]
C=1,
gamma=0.001,
kernel=rbf
………………………………
[CV]
……..
C=1,
gamma=0.001,
kernel=rbf,
score=0.963,
total=
0.0s
[CV]
C=1,
gamma=0.001,
kernel=rbf
………………………………
[CV]
……..
C=1,
gamma=0.001,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=1,
gamma=0.001,
kernel=rbf
………………………………
[CV]
……..
C=1,
gamma=0.001,
kernel=rbf,
score=0.987,
total=
0.0s
[CV]
C=1,
gamma=0.001,
kernel=rbf
………………………………
[CV]
……..
C=1,
gamma=0.001,
kernel=rbf,
score=0.962,
total=
0.0s
[CV]
C=1,
gamma=0.0001,
kernel=rbf
……………………………..
[CV]
…….
C=1,
gamma=0.0001,
kernel=rbf,
score=0.688,
total=
0.0s
[CV]
C=1,
gamma=0.0001,
kernel=rbf
……………………………..
[CV]
…….
C=1,
gamma=0.0001,
kernel=rbf,
score=0.700,
total=
0.0s
[CV]
C=1,
gamma=0.0001,
kernel=rbf
……………………………..
[CV]
…….
C=1,
gamma=0.0001,
kernel=rbf,
score=0.700,
total=
0.0s
[CV]
C=1,
gamma=0.0001,
kernel=rbf
……………………………..
[CV]
…….
C=1,
gamma=0.0001,
kernel=rbf,
score=0.696,
total=
0.0s
[CV]
C=1,
gamma=0.0001,
kernel=rbf
……………………………..
[CV]
…….
C=1,
gamma=0.0001,
kernel=rbf,
score=0.709,
total=
0.0s
[CV]
C=10,
gamma=1,
kernel=rbf
…………………………………
[CV]
………..
C=10,
gamma=1,
kernel=rbf,
score=0.637,
total=
0.0s
[CV]
C=10,
gamma=1,
kernel=rbf
…………………………………
[CV]
………..
C=10,
gamma=1,
kernel=rbf,
score=0.637,
total=
0.0s
[CV]
C=10,
gamma=1,
kernel=rbf
…………………………………
[CV]
………..
C=10,
gamma=1,
kernel=rbf,
score=0.637,
total=
0.0s
[CV]
C=10,
gamma=1,
kernel=rbf
…………………………………
[CV]
………..
C=10,
gamma=1,
kernel=rbf,
score=0.633,
total=
0.0s
[CV]
C=10,
gamma=1,
kernel=rbf
…………………………………
[CV]
………..
C=10,
gamma=1,
kernel=rbf,
score=0.633,
total=
0.0s
[CV]
C=10,
gamma=0.1,
kernel=rbf
……………………………….
[CV]
………
C=10,
gamma=0.1,
kernel=rbf,
score=0.950,
total=
0.0s
[CV]
C=10,
gamma=0.1,
kernel=rbf
……………………………….
[CV]
………
C=10,
gamma=0.1,
kernel=rbf,
score=0.963,
total=
0.0s
[CV]
C=10,
gamma=0.1,
kernel=rbf
……………………………….
[CV]
………
C=10,
gamma=0.1,
kernel=rbf,
score=0.963,
total=
0.0s
[CV]
C=10,
gamma=0.1,
kernel=rbf
……………………………….
[CV]
………
C=10,
gamma=0.1,
kernel=rbf,
score=0.987,
total=
0.0s
[CV]
C=10,
gamma=0.1,
kernel=rbf
……………………………….
[CV]
………
C=10,
gamma=0.1,
kernel=rbf,
score=0.987,
total=
0.0s
[CV]
C=10,
gamma=0.01,
kernel=rbf
………………………………
[CV]
……..
C=10,
gamma=0.01,
kernel=rbf,
score=0.963,
total=
0.0s
[CV]
C=10,
gamma=0.01,
kernel=rbf
………………………………
[CV]
……..
C=10,
gamma=0.01,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=10,
gamma=0.01,
kernel=rbf
………………………………
[CV]
……..
C=10,
gamma=0.01,
kernel=rbf,
score=1.000,
total=
0.0s
[CV]
C=10,
gamma=0.01,
kernel=rbf
………………………………
[CV]
……..
C=10,
gamma=0.01,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=10,
gamma=0.01,
kernel=rbf
………………………………
[CV]
……..
C=10,
gamma=0.01,
kernel=rbf,
score=0.987,
total=
0.0s
[CV]
C=10,
gamma=0.001,
kernel=rbf
……………………………..
[CV]
…….
C=10,
gamma=0.001,
kernel=rbf,
score=0.950,
total=
0.0s
[CV]
C=10,
gamma=0.001,
kernel=rbf
……………………………..
[CV]
…….
C=10,
gamma=0.001,
kernel=rbf,
score=0.988,
total=
0.0s
[CV]
C=10,
gamma=0.001,
kernel=rbf
……………………………..
[CV]
…….
C=10,
gamma=0.001,
kernel=rbf,
score=1.000,
total=
0.0s
[CV]
C=10,
gamma=0.001,
kernel=rbf
……………………………..
[CV]
…….
C=10,
gamma=0.001,
kernel=rbf,
score=0.987,
total=
0.0s
[CV]
C=10,
gamma=0.001,
kernel=rbf
……………………………..
[CV]
…….
C=10,
gamma=0.001,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=10,
gamma=0.0001,
kernel=rbf
…………………………….
[CV]
……
C=10,
gamma=0.0001,
kernel=rbf,
score=0.912,
total=
0.0s
[CV]
C=10,
gamma=0.0001,
kernel=rbf
…………………………….
[CV]
……
C=10,
gamma=0.0001,
kernel=rbf,
score=0.963,
total=
0.0s
[CV]
C=10,
gamma=0.0001,
kernel=rbf
…………………………….
[CV]
……
C=10,
gamma=0.0001,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=10,
gamma=0.0001,
kernel=rbf
…………………………….
[CV]
……
C=10,
gamma=0.0001,
kernel=rbf,
score=0.987,
total=
0.0s
[CV]
C=10,
gamma=0.0001,
kernel=rbf
…………………………….
[CV]
……
C=10,
gamma=0.0001,
kernel=rbf,
score=0.949,
total=
0.0s
[CV]
C=100,
gamma=1,
kernel=rbf
………………………………..
[CV]
……….
C=100,
gamma=1,
kernel=rbf,
score=0.637,
total=
0.0s
[CV]
C=100,
gamma=1,
kernel=rbf
………………………………..
[CV]
……….
C=100,
gamma=1,
kernel=rbf,
score=0.637,
total=
0.0s
[CV]
C=100,
gamma=1,
kernel=rbf
………………………………..
[CV]
……….
C=100,
gamma=1,
kernel=rbf,
score=0.637,
total=
0.0s
[CV]
C=100,
gamma=1,
kernel=rbf
………………………………..
[CV]
……….
C=100,
gamma=1,
kernel=rbf,
score=0.633,
total=
0.0s
[CV]
C=100,
gamma=1,
kernel=rbf
………………………………..
[CV]
……….
C=100,
gamma=1,
kernel=rbf,
score=0.633,
total=
0.0s
[CV]
C=100,
gamma=0.1,
kernel=rbf
………………………………
[CV]
……..
C=100,
gamma=0.1,
kernel=rbf,
score=0.950,
total=
0.0s
[CV]
C=100,
gamma=0.1,
kernel=rbf
………………………………
[CV]
……..
C=100,
gamma=0.1,
kernel=rbf,
score=0.963,
total=
0.0s
[CV]
C=100,
gamma=0.1,
kernel=rbf
………………………………
[CV]
……..
C=100,
gamma=0.1,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=100,
gamma=0.1,
kernel=rbf
………………………………
[CV]
……..
C=100,
gamma=0.1,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=100,
gamma=0.1,
kernel=rbf
………………………………
[CV]
……..
C=100,
gamma=0.1,
kernel=rbf,
score=0.987,
total=
0.0s
[CV]
C=100,
gamma=0.01,
kernel=rbf
……………………………..
[CV]
…….
C=100,
gamma=0.01,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=100,
gamma=0.01,
kernel=rbf
……………………………..
[CV]
…….
C=100,
gamma=0.01,
kernel=rbf,
score=0.938,
total=
0.0s
[CV]
C=100,
gamma=0.01,
kernel=rbf
……………………………..
[CV]
…….
C=100,
gamma=0.01,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=100,
gamma=0.01,
kernel=rbf
……………………………..
[CV]
…….
C=100,
gamma=0.01,
kernel=rbf,
score=0.949,
total=
0.0s
[CV]
C=100,
gamma=0.01,
kernel=rbf
……………………………..
[CV]
…….
C=100,
gamma=0.01,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=100,
gamma=0.001,
kernel=rbf
…………………………….
[CV]
……
C=100,
gamma=0.001,
kernel=rbf,
score=0.950,
total=
0.0s
[CV]
C=100,
gamma=0.001,
kernel=rbf
…………………………….
[CV]
……
C=100,
gamma=0.001,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=100,
gamma=0.001,
kernel=rbf
…………………………….
[CV]
……
C=100,
gamma=0.001,
kernel=rbf,
score=1.000,
total=
0.0s
[CV]
C=100,
gamma=0.001,
kernel=rbf
…………………………….
[CV]
……
C=100,
gamma=0.001,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=100,
gamma=0.001,
kernel=rbf
…………………………….
[CV]
……
C=100,
gamma=0.001,
kernel=rbf,
score=0.987,
total=
0.0s
[CV]
C=100,
gamma=0.0001,
kernel=rbf
……………………………
[CV]
…..
C=100,
gamma=0.0001,
kernel=rbf,
score=0.950,
total=
0.0s
[CV]
C=100,
gamma=0.0001,
kernel=rbf
……………………………
[CV]
…..
C=100,
gamma=0.0001,
kernel=rbf,
score=0.988,
total=
0.0s
[CV]
C=100,
gamma=0.0001,
kernel=rbf
……………………………
[CV]
…..
C=100,
gamma=0.0001,
kernel=rbf,
score=1.000,
total=
0.0s
[CV]
C=100,
gamma=0.0001,
kernel=rbf
……………………………
[CV]
…..
C=100,
gamma=0.0001,
kernel=rbf,
score=0.987,
total=
0.0s
[CV]
C=100,
gamma=0.0001,
kernel=rbf
……………………………
[CV]
…..
C=100,
gamma=0.0001,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=1000,
gamma=1,
kernel=rbf
……………………………….
[CV]
………
C=1000,
gamma=1,
kernel=rbf,
score=0.637,
total=
0.0s
[CV]
C=1000,
gamma=1,
kernel=rbf
……………………………….
[CV]
………
C=1000,
gamma=1,
kernel=rbf,
score=0.637,
total=
0.0s
[CV]
C=1000,
gamma=1,
kernel=rbf
……………………………….
[CV]
………
C=1000,
gamma=1,
kernel=rbf,
score=0.637,
total=
0.0s
[CV]
C=1000,
gamma=1,
kernel=rbf
……………………………….
[CV]
………
C=1000,
gamma=1,
kernel=rbf,
score=0.633,
total=
0.0s
[CV]
C=1000,
gamma=1,
kernel=rbf
……………………………….
[CV]
………
C=1000,
gamma=1,
kernel=rbf,
score=0.633,
total=
0.0s
[CV]
C=1000,
gamma=0.1,
kernel=rbf
……………………………..
[CV]
…….
C=1000,
gamma=0.1,
kernel=rbf,
score=0.950,
total=
0.0s
[CV]
C=1000,
gamma=0.1,
kernel=rbf
……………………………..
[CV]
…….
C=1000,
gamma=0.1,
kernel=rbf,
score=0.963,
total=
0.0s
[CV]
C=1000,
gamma=0.1,
kernel=rbf
……………………………..
[CV]
…….
C=1000,
gamma=0.1,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=1000,
gamma=0.1,
kernel=rbf
……………………………..
[CV]
…….
C=1000,
gamma=0.1,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=1000,
gamma=0.1,
kernel=rbf
……………………………..
[CV]
…….
C=1000,
gamma=0.1,
kernel=rbf,
score=0.987,
total=
0.0s
[CV]
C=1000,
gamma=0.01,
kernel=rbf
…………………………….
[CV]
……
C=1000,
gamma=0.01,
kernel=rbf,
score=0.950,
total=
0.0s
[CV]
C=1000,
gamma=0.01,
kernel=rbf
…………………………….
[CV]
……
C=1000,
gamma=0.01,
kernel=rbf,
score=0.950,
total=
0.0s
[CV]
C=1000,
gamma=0.01,
kernel=rbf
…………………………….
[CV]
……
C=1000,
gamma=0.01,
kernel=rbf,
score=0.950,
total=
0.0s
[CV]
C=1000,
gamma=0.01,
kernel=rbf
…………………………….
[CV]
……
C=1000,
gamma=0.01,
kernel=rbf,
score=0.937,
total=
0.0s
[CV]
C=1000,
gamma=0.01,
kernel=rbf
…………………………….
[CV]
……
C=1000,
gamma=0.01,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=1000,
gamma=0.001,
kernel=rbf
……………………………
[CV]
…..
C=1000,
gamma=0.001,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=1000,
gamma=0.001,
kernel=rbf
……………………………
[CV]
…..
C=1000,
gamma=0.001,
kernel=rbf,
score=0.963,
total=
0.0s
[CV]
C=1000,
gamma=0.001,
kernel=rbf
……………………………
[CV]
…..
C=1000,
gamma=0.001,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=1000,
gamma=0.001,
kernel=rbf
……………………………
[CV]
…..
C=1000,
gamma=0.001,
kernel=rbf,
score=0.949,
total=
0.0s
[CV]
C=1000,
gamma=0.001,
kernel=rbf
……………………………
[CV]
…..
C=1000,
gamma=0.001,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=1000,
gamma=0.0001,
kernel=rbf
…………………………..
[CV]
….
C=1000,
gamma=0.0001,
kernel=rbf,
score=0.950,
total=
0.0s
[CV]
C=1000,
gamma=0.0001,
kernel=rbf
…………………………..
[CV]
….
C=1000,
gamma=0.0001,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=1000,
gamma=0.0001,
kernel=rbf
…………………………..
[CV]
….
C=1000,
gamma=0.0001,
kernel=rbf,
score=1.000,
total=
0.0s
[CV]
C=1000,
gamma=0.0001,
kernel=rbf
…………………………..
[CV]
….
C=1000,
gamma=0.0001,
kernel=rbf,
score=0.975,
total=
0.0s
[CV]
C=1000,
gamma=0.0001,
kernel=rbf
…………………………..
[CV]
….
C=1000,
gamma=0.0001,
kernel=rbf,
score=0.987,
total=
0.0s
[Parallel(n_jobs=1)]:
Done
125
out
of
125
|
elapsed:
1.0s
finished
GridSearchCV(estimator=SVC(),
param_grid={‘C’:
[0.1,
1,
10,
100,
1000],
‘gamma’:
[1,
0.1,
0.01,
0.001,
0.0001],
‘kernel’:
[‘rbf’]},
verbose=3)
You can inspect the best parameters found by GridSearchCV using the bestparams attribute, and the best estimator using the
best_estimator_ attribute. Here we see that the best set of parameters from the ones we specified are 10 for c value, 0.01 for
gamma, and ‘rbf’ for the kernel.
{‘C’:
10,
’gamma’:
0.01,
’kernel’:
’rbf’}
Then you can re‑run predictions on this grid object just like you would with a normal model.
[[105
0]
[
2
64]]
precision
recall
f1score
support
B
0.98
1.00
0.99
105
M
1.00
0.97
0.98
66
accuracy
0.99
171
macro
avg
0.99
0.98
0.99
171
weighted
avg
0.99
0.99
0.99
171
Nice! We got a slightly better improvement using these parameters, though our original accuracy was already very good. Keep this
grid search in mind when you need to do hyperparameter tuning. It can save you a lot of time.
Congrats! You now know how to use SVM and hyperparameter tuning in sklearn. Try using this on your own dataset and refer
back to this lecture if you get stuck.
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavi
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280
5 rows × 33 columns
radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean
concave
points_mean
0 1.097064 ‑2.073335 1.269934 0.984375 1.568466 3.283515 2.652874 2.532475
1 1.829821 ‑0.353632 1.685955 1.908708 ‑0.826962 ‑0.487072 ‑0.023846 0.548144
2 1.579888 0.456187 1.566503 1.558884 0.942210 1.052926 1.363478 2.037231
3 ‑0.768909 0.253732 ‑0.592687 ‑0.764464 3.283553 3.402909 1.915897 1.451707
4 1.750297 ‑1.151816 1.776573 1.826229 0.280372 0.539340 1.371011 1.428493
5 rows × 30 columns
In
[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In
[2]:
from matplotlib import rcParams
rcParams[‘figure.figsize’] = 15, 5
sns.set_style(‘darkgrid’)
In
[3]:
breast_cancer_df = pd.read_csv(‘data.csv’)
breast_cancer_df.head()
Out[3]: In
[4]:
breast_cancer_df.drop(labels=[‘Unnamed:
32’, ‘id’], axis=1, inplace=True)
In
[5]:
breast_cancer_df.info()
In
[6]: sns.pairplot(breast_cancer_df, hue=’diagnosis’, vars=[‘radius_mean’, ‘texture_mean’, ‘perimeter_mean’, ‘area_
‘smoothness_mean’, ‘compactness_mean’, ‘concavity_mean’,
‘concave
points_mean’, ‘symmetry_mean’, ‘fractal_dimension_mean’])
plt.show()
In
[7]: sns.countplot(x=breast_cancer_df[‘diagnosis’])
plt.show()
In
[8]: sns.scatterplot(x = ‘area_mean’, y = ‘smoothness_mean’, hue = ‘diagnosis’, data = breast_cancer_df)
plt.show()
In
[9]:
plt.figure(figsize=(20,10))
sns.heatmap(breast_cancer_df.corr(), annot=True)
plt.show()
In
[10]:
from sklearn.preprocessing import StandardScaler
#
all
columns
except
’Outcome’
X = breast_cancer_df.drop(‘diagnosis’, axis=1)
y = breast_cancer_df[‘diagnosis’]
#
create
our
scaler
object
scaler = StandardScaler()
#
use
our
scaler
object
to
transform/scale
our
data
and
save
it
into
X_scaled
X_scaled = scaler.fit_transform(X)
#
reassign
X
to
a
new
DataFrame
using
the
X_scaled
values.
X = pd.DataFrame(data=X_scaled, columns=X.columns)
In
[11]:
X.head()
Out[11]: In
[12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
In
[13]:
from sklearn.svm import SVC
#
instantiate
the
model
with
default
parameters
model = SVC()
#
fit/train
model.fit(X_train,y_train)
Out[13]: In
[14]:
predictions = model.predict(X_test)
In
[15]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, predictions)
print(‘Confusion
matrix\n\n’, cm)
print(‘\nTrue
Positives(TP)
=
’, cm[0,0])
print(‘\nTrue
Negatives(TN)
=
’, cm[1,1])
print(‘\nFalse
Positives(FP)
=
’, cm[0,1])
print(‘\nFalse
Negatives(FN)
=
’, cm[1,0])
In
[16]:
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))
In
[17]:
param_grid = {‘C’: [0.1,1, 10, 100, 1000], ‘gamma’: [1,0.1,0.01,0.001,0.0001], ‘kernel’: [‘rbf’]}
In
[19]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)
In
[20]:
grid.fit(X_train,y_train)
Out[20]: In
[21]:
grid.best_params_
Out[21]: In
[23]:
grid_predictions = grid.predict(X_test)
In
[24]:
print(confusion_matrix(y_test,grid_predictions))
print(classification_report(y_test,grid_predictions))
COSC 3337



