## Description

## 1. (30 points) Decision Trees (and Random Forests):

For this part of the assignment you will work on detecting forest covertype using CoverType dataset. You

can read about the dataset in detail from this link but work only on the data provided in the download link

on the course page. You have been provided with a pre-defined set of test, train and validation of dataset

to work with (available for download from the course website).

In this dataset, the first row represents

the labels of each column such that the last column represents the cover type and following that, each row

represents an example. You have to implement the decision tree algorithm for predicting the cover type.

You will also experiment with Random Forests in the last part of this problem.

(a) (10 points) Decision Tree Construction Construct a decision tree using the given data to predict

which files are infected. You should use mutual information as the criteria for selecting the attribute

to split on. At each node, you should select the attribute which results in maximum decrease in

the entropy of the class variable (i.e. has the highest mutual information with respect to the class

variable).

This problem has some attributes as integers and some as boolean. For handling continuous

attributes, you should use the following procedure. At any given internal node of the tree, a numerical

attribute is considered for a two way split by calculating the median attribute value from the data

instances coming to that node, and then computing the information gain if the data was split based

on whether the numerical value of the attribute is greater than the median or not. For example, if

1

you have have 10 instances coming to a node, with values of an attribute being (0,0,0,1,1,2,2,2,3,4) in

the 10 instances, then we will split on value 1 of the attribute (median). Note that in this setting, a

numerical attribute can be considered for splitting multiple number of times.

At any step, choose the

attribute which results in highest mutual information by splitting on its median value as described

above. Plot the train, validation and test set accuracies against the number of nodes in the tree as you

grow the tree. On X-axis you should plot the number of nodes in the tree and Y-axis should represent

the accuracy. Comment on your observations.

(b) (6 points) Decision Tree Post Pruning One of the ways to reduce overfitting in decision trees

is to grow the tree fully and then use post-pruning based on a validation set. In post-pruning, we

greedily prune the nodes of the tree (and sub-tree below them) by iteratively picking a node to prune

so that resultant tree gives maximum increase in accuracy on the validation set.

In other words, among

all the nodes in the tree, we prune the node such that pruning it (and sub-tree below it) results in

maximum increase in accuracy over the validation set. This is repeated until any further pruning leads

to decrease in accuracy over the validation set. Read the following notes on pruning decision trees to

avoid overfitting (also available from the course website). Post prune the tree obtained in step (a)

above using the validation set. Again plot the training, validation and test set accuracies against the

number of nodes in the tree as you successively prune the tree. Comment on your findings.

(c) (10 points) Random Forests: As discussed in class, Random Forests are extensions of decision

trees, where we grow multiple decision trees in parallel on bootstrapped samples constructed from the

original training data. A number of libraries are available for learning Random Forests over a given

training data. In this particular question you will use the scikit-learn library of Python to grow a

Random Forest. Click here to read the documentation and the details of various parameter options.

Try growing different forests by playing around with various parameter values. Especially, you should

experiment with the following parameter values (in the given range): (a) n estimators (50 to 450 in

range of 100). (b) max features (0.1 to 1.0 in range of 0.2) (c) min samples split (2 to 10 in range of

2). You are free to try out non-default settings of other parameters too. Use the out-of-bag accuracy

(as explained in the class) to tune to the optimal values for these parameters. You should perform

a grid search over the space of parameters (read the description at the link provided for performing

grid search). Report training, out-of-bag, validation and test set accuracies for the optimal set of

parameters obtained. How do your numbers, i.e., train, validation and test set accuracies compare

with those you obtained in part (b) above (obtained after pruning)?

(d) (4 points) Random Forests – Parameter Sensitivity Analysis: Once you obtain the optimal

set of parameters for Random Forests (part (c) above), vary one of the parameters (in a range) while

fixing others to their optimum. Plot the validation and test Repeat this for each of the parameters

considered above. What do you observe? How sensitive is the model to the value of each parameter?

Comment.

## 2. (30 points) Neural Networks :

In this problem, you will work with the Kannada Digit Classification Dataset available for download from

the course website. You can read about the dataset here. This is an image dataset consisting of grayscale

pixel values of 28×28 size images of 10 categories representing 0-9 digits in Kannada language. The Dataset

contains a train and a test set consisting of 60000 and 10000 examples, respectively. Four files representing

(x, y) pairs for train and test are given in npy format. You can load them using numpy’s load method.

Each row in x train and x test contains 784 columns representing a flattened image of size 28 × 28 The

i

th row in y train gives the digit number for the i

th example row of x train (similarly for x test and y test

files).

In this problem, we will use what is referred to a one-hot encoding of the output labels. Given a total

of r classes, the output is represented an r sized binary vector (i.e., y ∈ Rr×1

), such that each component

represents a Boolean variable, i.e., yl ∈ {0, 1}, ∀l, 1 ≤ l ≤ r). In other words, each y vector will have exactly

one entry as being 1 which corresponds to the actual class label and all others will be 0. This is one of

the standard ways to represent discrete data in vector form 1

. Corresponding to each output label yl

, our

network will produce (independently) an output label ol where ol ∈ [0, 1].

(a) (10 points) Write a program to implement a generic neural network architecture to learn a model for

multi-class classification using one-hot encoding as described above. You will implement the backpropagation algorithm (from first principles) to train your network. You should use mini-batch Stochastic

Gradient Descent (mini-batch SGD) algorithm to train your network. Use the Mean Squared Error

1Feel free to read about the one-hot encoding of discrete variables online

2 COL 774

(MSE) over each mini-batch as your loss function. Given a total of m examples, and M samples in

each batch, the loss corresponding to batch #b can be described as:

J

b

(θ) = 1

2M

X

bM

i=(b−1)M

Xr

l=1

(y

(i)

l − o

(i)

l

)

2

(1)

Here each y

(i)

is represented using one-hot encoding as described above. You will use the sigmoid as

activation function for the units in output layer as well as in the hidden layer (we will experiment with

other activation units in one of the parts below). Your implementation (including back-propagation)

MUST be from first principles and not using any pre-existing library in Python for the same. It should

be generic enough to create an architecture based on the following input parameters:

• Mini-Batch Size (M)

• Number of features/attributes (n)

• Hidden layer architecture: List of numbers denoting the number of perceptrons in the corresponding hidden layer. Eg. a list [100 50] specifies two hidden layers; first one with 100 units and second

one with 50 units.

• Number of target classes (r)

Assume a fully connected architecture i.e., each unit in a hidden layer is connected to every unit in

the next layer.

(b) (5 points) Use the above implementation to experiment with a neural network having a single hidden

layer. Vary the number of hidden layer units from the set {1, 10, 50, 100, 500}. Set the learning rate to

0.001. Use a mini-batch size of 100 examples. This will remain constant for the remaining experiments

in the parts below. Choose a suitable stopping criterion and report it. Report and plot the accuracy

on the training and the test sets, time taken to train the network.

Plot the metric on the Y axis

against the number of hidden layer units on the X axis. What do you observe? How do the above

metrics change with the number of hidden layer units? NOTE: For accuracy computation, the inferred

class label is simply the label having the highest probability as output by the network.

(c) (5 points) Use an adaptive learning rate inversely proportional to number of epochs i.e. ηe = √

η0

e

where η0 = 0.5 is the seed value and e is the current epoch number2

. See if you need to change your

stopping criteria. Report your stopping criterion. As before, plot the train/test set accuracy, as well

as training time, for each of the number of hidden layers as used in 2b above using this new adaptive

learning rate. How do your results compare with those obtained in the part above? Does the adaptive

learning rate make training any faster? Comment on your observations.

(d) (5 points) Several activation units other than sigmoid have been proposed in the literature such as

tanh, and RelU to introduce non linearity into the network. ReLU is defined using the function: g(z)

= max(0, z). In this part, we will replace the sigmoid activation units by the ReLU for all the units

in the hidden layers of the network (the activation for units in the output layer will still be sigmoid

to make sure the output is in the range (0, 1)). You can read about relative advantage of using the

ReLU over several other activation units on this blog.

Change your code to work with the ReLU activation unit. Note that there is a small issue with

ReLU that it non-differentiable at z = 0. This can be resolved by making the use of sub-gradients –

intuitively, sub-gradient allows us to make use of any value between the left and right derivative at the

point of non-differentiability to perform gradient descent (see this Wikipedia page for more details).

Implement a network with 2 hidden layers with 100 units each. Experiment with both ReLU and

sigmoid activation units as described above. Use the adaptive learning rate as described in part2c

above. Report your training and test set accuracies in each case. Also, make a relative comparison of

test set accuracies obtained using the two kinds of units. What do you observe? Which ones performs

better? Also, how do these results compare with results in part 2b using a single hidden layer with

sigmoid. Comment on your observations.

(e) (5 points) Use MLPClassifier from scikit-learn library to implement a neural network with the same

architecture as in Part 2d above, and same kind of activation functions (ReLU for hidden layers, sigmoid for output). Use Stochastic Gradient Descent as the solver. Note that MLPClassifier

only allows for Cross Entropy Loss (log-loss function) over the final network output. Read about

2One epoch corresponds to one complete pass through the data

3

the binary cross entropy loss . Compare the performance with the results of Part 2d. How does training using existing library (and modified loss function) compare with your results in Part 2d above.

(f) Extra fun – No credits! Kannada digit dataset was motivated by another famous digits dataset

known as MNIST (available for download from course web-page). You can read about it here. In this

part, we want you to try out doing classification on MNIST using the same architecture that worked

in recognizing Kannada digits and see how well it performs. How robust is the architecture to a new

dataset? Comment on your observations and report you best model.

4 COL 774: Assignment 3