COSC 3337 Week 6 Lab solution

$30.00

Download Details:

  • Name: Lab-6-geoxpn.zip
  • Type: zip
  • Size: 680.43 KB

Category:

Description

5/5 - (4 votes)

(Decision Trees and Random Forest)

This lab will walk you through how you can use decision tree and random forest in sklearn on your own datasets. We will also be
comparing the two methods. To begin, let’s quickly review some of the decision tree intuition that should sound familiar if you’ve
attended the corresponding decision tree lecture.
Intuition
The Decision Tree Algorithm
This is a supervised learning algorithm, but unlike other supervised learning algorithms, the decision tree algorithm can be used
for solving both regression and classification problems.
The goal of a Decision Tree is to create a model that can predict the class or value of the target variable by learning simple
decision rules inferred from the training data.
Decision trees classify the examples by sorting them down the tree from the root to some leaf/terminal node, with the leaf/terminal
node providing the classification of the example. Each node in the tree acts as a test case for some attribute, and each edge
descending from the node corresponds to the possible answers to the test case. This process is recursive in nature and is
repeated for every subtree rooted at the new node.
The primary challenge in the decision tree implementation is to identify which attributes do we need to consider as the root node
at each level. For solving this attribute selection problem, researchers have devised some of the following attribute selection
measures:
Entropy,
Information gain,
Gini index,
Gain Ratio,
Reduction in Variance
Chi‑Square
These criterias will calculate values for every attribute. The values are sorted, and attributes are placed in the tree by following the
order i.e, the attribute with the highest value(in case of information gain) is placed at the root.
Note: The most popular attribute selection methods that we’ll use in this course are information gain and gini index.
Potential Problems
Overfitting is a practical problem while building a Decision‑Tree model. The problem of overfitting is considered when the
algorithm continues to go deeper and deeper to reduce the training‑set error but results with an increased test‑set error. So,
accuracy of prediction for our model goes down. It generally happens when we build many branches due to outliers and
irregularities in data.
To avoid overfitting, we can use the following:
Pre‑Pruning: Stop the tree construction a bit early. We prefer not to split a node if its goodness measure is below a threshold
value, but it is difficult to choose an appropriate stopping point.
Post‑Pruning: First generate the decision tree and then remove non‑significant branches. Post‑pruning a decision tree implies
that we begin by generating the (complete) tree and then adjust it with the aim of improving the accuracy on unseen
instances.
About The Data
We’ll be using the Car Evaluation Data Set from the UCI Machine Learning Repository for this lab, but feel free to follow along with
your own dataset. The dataset contains the following attributes:
buying (v‑high, high, med, low)
maint (v‑high, high, med, low)
doors (2, 3, 4, 5‑more)
persons (2, 4, more)
lug_boot (small, med, big)
safety (low, med, high)
class (unacc, acc, good, vgood)
Quick Exploratory Data Analysis
Let’s begin by importing some necessary libraries that we’ll be using to explore the data.
Our first step is to load the data into a pandas DataFrame. For some reason, this dataset did not come with a header/column
names, so we will specify that when loading the data and manually add the column names ourselves.
buying maint doors persons lug_boot safety class
0 vhigh vhigh 2 2 small low unacc
1 vhigh vhigh 2 2 small med unacc
2 vhigh vhigh 2 2 small high unacc
3 vhigh vhigh 2 2 med low unacc
4 vhigh vhigh 2 2 med med unacc
After checking .info() we can see that there are no missing values. Our dataset contains 1728 entries, and each of our columns
contains 1728 non‑null values.
<class
’pandas.core.frame.DataFrame’>
RangeIndex:
1728
entries,
0
to
1727
Data
columns
(total
7
columns):
#


Column



Non­Null
Count

Dtype
­­­

­­­­­­



­­­­­­­­­­­­­­

­­­­­
0


buying



1728
non­null


object
1


maint




1728
non­null


object
2


doors




1728
non­null


object
3


persons


1728
non­null


object
4


lug_boot

1728
non­null


object
5


safety



1728
non­null


object
6


class




1728
non­null


object
dtypes:
object(7)
memory
usage:
94.6+
KB
If we create countplots of each attribute we can see that there seems to be an equal balance of each unique type in each column.
Let’s also take a look at our target/class variable
unacc



1210
acc






384
good






69
vgood





65
Name:
class,
dtype:
int64
Majority of our dataset consists of the unacc and acc, with very few vgood and good records.
Data Preprocessing
Let’s now prepare our data for training. Notice that all of our variables are ordinal categorical variables. When dealing with ordinal
categorical variables, you want to make sure to preserve the order when encoding them, so we can use sklearn’s ordinal encoder,
or manually map each unique value in each column to some number [0, n_classes‑1].
Note: We can’t use one hot encoding / get_dummies here because that won’t preserve the order.
buying maint doors persons lug_boot safety class
0 vhigh vhigh 2 2 small low unacc
1 vhigh vhigh 2 2 small med unacc
2 vhigh vhigh 2 2 small high unacc
3 vhigh vhigh 2 2 med low unacc
4 vhigh vhigh 2 2 med med unacc
One way to do our encoding is by first creating mappings that preserve the order in each column. For example, low category maps
to 0, and vhigh category maps with 3.
We can then pass use pandas .map(dictionary) to apply our mappings to the necessary columns.
displaying our DataFrame again we can see that it’s now ready for training.
buying maint doors persons lug_boot safety class
0 3 3 2 2 0 0 0
1 3 3 2 2 0 1 0
2 3 3 2 2 0 2 0
3 3 3 2 2 1 0 0
4 3 3 2 2 1 1 0
Creating Our Tree Models
We’re now ready to begin creating and training our model. We first need to split our data into training and testing sets. This can be
done using sklearn’s train_test_split(X, y, test_size) function. This function takes in your features (X), the target variable (y), and
the test_size you’d like (Generally a test size of around 0.3 is good enough). It will then return a tuple of X_train, X_test, y_train,
y_test sets for us. We will train our model on the training set and then use the test set to evaluate the model.
We’ll now import sklearn’s DecisionTreeClassifier model and begin training it using the fit(train_data, train_data_labels) method.
In a nutshell, fitting is equal to training. Then, after it is trained, the model can be used to make predictions, usually with a
predict(test_data) method call.
DecisionTreeClassifier(max_depth=3,
random_state=0)
Model Evaluation
Now that we’ve finished training, we can make predictions off of the test data and evaluate our model’s performance using the
corresponding test data labels (y_test).
we’ll import sklearn’s accuracy_score to evaluate our model. This will take the true values and predictions as input.
Model
accuracy
score
with
criterion
gini
index:
0.7803
Let’s also compare the train‑set and test‑set accuracy and check for overfitting.
Training
set
score:
0.7965
Test
set
score:
0.7803
Here, the training‑set accuracy score is 0.7965 while the test‑set accuracy is 0.7803. These two values are quite comparable, so
there is no sign of overfitting.
Visualize decision‑trees
Note: Try running the 3 lines of code that are commented out. If the arrows don’t appear then you’ll have to run the uncommented
code to manually fix the arrows. This is a jupyter notebook issue some people face when using sklearn’s tree visualizer.
Awesome! As a bonus exercise, try creating a Decision Tree Classifier with criterion entropy instead.
Random Forests
Now let’s compare the decision tree model to a random forest. This is fairly quick to do using sklearn.
RandomForestClassifier()
Model
accuracy
score:
0.9653
Much stronger performance! Why do you think the random forest performed better? Refer back to lecture powerpoints if you’re
In
[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In
[2]:
from matplotlib import rcParams
rcParams[‘figure.figsize’] = 15, 5
sns.set_style(‘darkgrid’)
In
[3]:
car_data = pd.read_csv(‘car_evaluation.csv’, header=None)
car_data.columns = [‘buying’, ‘maint’, ‘doors’, ‘persons’, ‘lug_boot’, ‘safety’, ‘class’]
car_data.head()
Out[3]: In
[4]:
car_data.info()
In
[5]:
fig, axes = plt.subplots(nrows=3, ncols=2, sharey=True, figsize=(14, 10))
sns.countplot(x=car_data[‘buying’], ax=axes[0][0])
sns.countplot(x=car_data[‘maint’], ax=axes[0][1])
sns.countplot(x=car_data[‘doors’], ax=axes[1][0])
sns.countplot(x=car_data[‘persons’], ax=axes[1][1])
sns.countplot(x=car_data[‘lug_boot’], ax=axes[2][0])
sns.countplot(x=car_data[‘safety’], ax=axes[2][1])
plt.show()
In
[6]: sns.countplot(x=car_data[‘class’])
plt.show()
car_data[‘class’].value_counts()
Out[6]: In
[7]:
car_data.head()
Out[7]: In
[8]:
buying_mappings = {‘low’:0, ‘med’:1, ‘high’:2, ‘vhigh’:3}
maint_mappings = {‘low’:0, ‘med’:1, ‘high’:2, ‘vhigh’:3}
door_mappings = {‘2’:2, ‘3’:3, ‘4’:4, ‘5more’:5}
persons_mappings = {‘2’:2, ‘4’:4, ‘more’:5}
lug_boot_mappings = {‘small’:0, ‘med’:1, ‘big’:2}
safety_mappings = {‘low’:0, ‘med’:1, ‘high’:2}
class_mappings = {‘unacc’:0, ‘acc’:1, ‘good’:2, ‘vgood’:3}
In
[9]:
car_data[‘buying’] = car_data[‘buying’].map(buying_mappings)
car_data[‘maint’] = car_data[‘maint’].map(maint_mappings)
car_data[‘doors’] = car_data[‘doors’].map(door_mappings)
car_data[‘persons’] = car_data[‘persons’].map(persons_mappings)
car_data[‘lug_boot’] = car_data[‘lug_boot’].map(lug_boot_mappings)
car_data[‘safety’] = car_data[‘safety’].map(safety_mappings)
car_data[‘class’] = car_data[‘class’].map(class_mappings)
In
[10]:
car_data.head()
Out[10]: In
[11]:
from sklearn.model_selection import train_test_split
X = car_data[[‘buying’, ‘maint’, ‘doors’, ‘persons’, ‘lug_boot’, ‘safety’]]
y = car_data[‘class’]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
In
[12]:
from sklearn.tree import DecisionTreeClassifier
#
instantiate
the
DecisionTreeClassifier
model
with
criterion
gini
index
clf_gini = DecisionTreeClassifier(criterion=’gini’, max_depth=3, random_state=0)
#
fit
the
model
clf_gini.fit(X_train, y_train)
Out[12]: In
[13]:
#
predict
the
test
set
results
with
criterion
gini
index
y_pred_gini = clf_gini.predict(X_test)
In
[14]:
from sklearn.metrics import accuracy_score
print(‘Model
accuracy
score
with
criterion
gini
index:
{0:0.4f}’.format(accuracy_score(y_test, y_pred_gini))
In
[15]:
y_pred_train_gini = clf_gini.predict(X_train)
print(‘Training
set
score:
{:.4f}’.format(clf_gini.score(X_train, y_train)))
print(‘Test
set
score:
{:.4f}’.format(clf_gini.score(X_test, y_test)))
In
[23]:
#
from
sklearn
import
tree
#
tree.plot_tree﴾clf_gini﴿
#
plt.show﴾﴿
fig, ax = plt.subplots(figsize=(10,10))
out = tree.plot_tree(clf_gini, filled=True, rounded=True,
feature_names=[‘buying’, ‘maint’, ‘doors’, ‘persons’, ‘lug_boot’, ‘safety’])
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor(‘black’)
arrow.set_linewidth(3)
In
[24]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
Out[24]: In
[28]:
rfc_pred = rfc.predict(X_test)
print(‘Model
accuracy
score:
{0:0.4f}’.format(accuracy_score(y_test, rfc_pred)))
unsure.
Great job! You now know how to use decision tree and random forest in sklearn. Try using decision trees and or random forest on
your own dataset and refer back to this lecture if you get stuck.
COSC 3337