## Description

## 1.1. (Tree, 10 points)

Consider the following table which predicts whether a student is likely to

be overweight. Use crossentropy as criteria to construct the classification decision tree that predicts

whether or not a student is likely to be overweight. Show your step of the computation.

Stu ID Gender Hyperlipidemia Unhealthy Diet Exercises Overheight

1 boy yes no yes yes

2 boy yes yes no yes

3 girl no yes no yes

4 boy no no yes no

5 girl yes yes yes yes

6 boy no yes yes no

## 1.2. (Backpropagation for MLP, 10 points)

Consider the following classification MLP with

one hidden layer:

x = input ∈ R

D

z = W x + b1 ∈ R

K

h = ReLU(z) ∈ R

K

a = V h + b2 ∈ R

C

L = CrossEntropy(y, S(a)) ∈ R

where x ∈ R

D, b1 ∈ R

K, W ∈ R

K×D, b2 ∈ R

C, V ∈ R

C×K, where D is the size of the input, K

is the number of hidden units, and C is the number of classes. Draw the computational graph of

forward pass and derive the specific form ψi

, i = 1, . . . , 5 of gradients w.r.t. the parameters and

input.

∇V L = [ ∂L

∂V

]1,: = ψ1(u2, h) ∈ R

C×K

∇b2L = ( ∂L

∂b2

)

T = ψ2(u2) ∈ R

C

∇W L = [ ∂L

∂W

]1,: = ψ3(u1, x) ∈ R

K×D

∇b1L = ( ∂L

∂b1

)

T = ψ4(u1) ∈ R

K

∇xL = (∂L

∂x

)

T = ψ5(W,u1) ∈ R

D

where the gradients of the loss w.r.t. the two layers (logit and hidden) are given by the following:

u2 = ∇aL = (∂L

∂a

)

T = (p − y) ∈ R

C

u1 = ∇zL = (∂L

∂z

)

T = (V

Tu2) ⊙ H(z) ∈ R

K

2

where H(a) = I(a > 0) is the heaviside function. Note that, in our notation, the gradient (which

has the same shape as the variable with respect to which we differentiate) is equal to the Jacobian’s

transpose when the variable is a vector and to the first slice of the Jacobian when the variable is a

mtrix.

## 1.3. (Connection of neural network to logistic regression, 5 points)

Consider a neural

network for a K class outcome that uses cross entropy loss. If the network has no hidden layer,

show that the model is equivalent to the multinomial logistic model.

## 1.4. (CNN, 10 points)

Consider the convolutional network defined by the layers below. The

input shape is 32 × 32 × 3 and the output is 10 neurons. Consider layers:

Conv5(10) + Maxpool2 + Conv5(10) + Maxpool2 + FC10

where

• Conv5(10): 10 filters with each size 5 × 5 × D, where D is the depth of the activation volume

at the previous layer, stride = 1, padding = 2;

• Maxpool2

: 2 × 2 filter, stride = 2, padding = 0;

• FC10: A fully-connected layer with 10 output neurons.

(a) (5 pts) Compute the shape of activation map of each layer.

(b) (5 pts) Compute the total number of parameters of each layer.

## 1.5. (Dropout, 5 points)

To expect a good predictive model. We want it to peform well on

unseen data. Classical generalization theory suggests that we should close the gap between train

and test performance. Dropout is a standard regularization technique for training neural networks.

The method is called dropout because we literally drop out some neurons during training as in Fig.

1. Throughout training, on each iteration, standard dropout consists of zeroing out some fraction

of the nodes in each layer before calculating the subsequent layer. This prevents co-adaptations

where features are only useful when present in the presence of other features. It can dramatically

reduce overfitting and has been very widely used.

(a) (1 pts) Give some intuitive reason why dropout prevents complex co-adaptation of the hidden

units.

(b) (2 pts) In standard dropout regularization, one zeros out some fraction of the nodes in each

layer and then debiases each layer by normalizing by the fraction of nodes that were retained

3

Figure 1: MLP before and after dropout

(not dropped out). In other words, with dropout probability p, each intermediate activation h

is replaced by a random variable h

′ as follows:

h

′ =

0 with probability p

h

1−p

otherwise

Explain why we have a proportional coefficient 1

1−p

;

(c) (2 pts) Suppose we apply dropout to the activation h

(ℓ)

to get the output h

(ℓ+1) of the dropout

layer, i.e., h

(ℓ+1) = h

(ℓ) ⊙ mask, derive the gradient ∂L

∂h(ℓ) as a form of chain rule.

Hint: You can check the following materials:

• Srivastava et al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting,

https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf

• Stanford. CS231n: Dropout, https://cs231n.github.io/neural-networks-2/

## 1.6. (Assessing AUC and Performance of a Binary Classifier, 10 points)

Given a 2-d

dataset of 5 positive and 5 negative samples, and consider a simple linear model with a sigmoid

function f(x) = σ(w1x1 + w2x2) where w1 = 0.5 and w2 = −0.1, please complete the following

tasks:

Positive samples :

5 5

3 8

−1 8

5 4

−1 −2

Negative samples :

−5 1

2 −2

−1 1

5 −10

−10 −9

(a) (2 pts) Compute the predictions for all data points in the dataset using the provided model

f(x);

4

(b) (3 pts) Assume the threshold is 0.5, evaluate the performance of the model by calculating the

following evaluation scores:

• Accuracy

• Precision

• Recall

• F1 Score

• Confusion Matrix

(c) (3 pts) Construct a ROC curve for the model as follows

• The list of threshold is [0, 0.2, 0.4, 0.6, 0.8, 1.0];

• Hand-drawn is required;

• x-axis is FPR and y-axis is TPR;

• Use Fig. 2 as a reference.

Figure 2: Example of step-wise ROC curve

(d) (2 pts) Calculate the Area Under the ROC Curve, i.e., AUC.

##
2 Programming (50 points)

2.1 Tree (15 points)

The goal of this exercise is to practice tree-based methods on Diabetes data-set1 and predict if a

patient has diabetes. You are required to conduct some analysis and submit a complete report as

well as your executable code. Your report should contain :

• Preliminary data analysis. Some visualization of the data distribution/variability.

1https://www.kaggle.com/datasets/piyushborhade/diabetes-dataset

5

• Some data processing if you find it necessary. Data splitting with ratio 0.33.

• One implementation of uni-variate tree method for the classification problem.

• One implementation of an ensemble model for the classification problem. You can choose any

ensemble model that you have heard of.

• Result visualizations, comments on the results, your personal thinking, etc.

• Optional: can you think of a way to show how your tree splits the whole space?

You are free to use sklearn and other published packages. Your work is graded according to the

readability, the logic and the degree of completion presented in your report. The coherence of you

code is also taken into account.

Submission (-15 points if you do not submit code)

1. A PDF report HW3 Programming Tree.pdf. Your statements should be well supported

by figures or code explanation.

##
2. Your executable code script or notebook.

2.2 Neural Networks (35 points)

In this exercise, you will build your own fully-connected Neural Network (MLP, multi-layer perceptron) as well as your own Convolutional Neural Network. You will apply them on MNIST data-set

to perform hand-written digit classification.

You are provided with a notebook guide HW3 Programming NN guide.ipynb where PyTorch

framework is adopted. You have 2 options:

1. Follow the guide and fill in the blanks. Then, export the notebook to pdf file as your report.

You need to have either pickle or torchvision in your python environment to load the data-set.

2. If you are significantly more familiar with some other deep learning framework (e.g. TensorFlow). You can perform freely on this work. Please refer to part 2: simple NN and part

4: CNN of the guide for required procedures and outputs. You will be graded according to

the standard given in the guide, and your grade for 2.2 will be re-scaled with 35

25 .

## Note that:

• Do not modify the data-sets that the TA splitted for you in the guide (The train set is the

first 5500 samples of the original MNIST train set; the validation set is the last 500 samples

of the original MNIST train set; the test set is the MNIST test set with 10000 samples).

6

• If you obtain some bonus points, those points will ameliorate or saturate your grade of this

exercise (2.2) exclusively (not the whole programming part nor the whole homework).

• If you really struggle with the PyTorch Basics part, you can ask the TA for solution directly:

– subject: Demand for PyTorch Basics solution; content: your name + student ID

You can still do the bonus part to obtain at most 32 points on this exercise.

• If any of your classmates asks to copy your code, please make sure that they have already

asked the TA for the Basics solution. They are supposed to show you a screen shot of the

TA’s reply (“copy”) to their demand.

• No model score (3) nor performance score (4.5) for part 4: CNN if you directly load any

published CNN model to solve the problem.

Submission (-35 points if you do not submit code; no performance scores if you do not

include your model parameters)

1. A PDF report HW3 Programming NN.pdf.

2. Your executable code script or notebook, your best NN model, your best CNN model.

Discussion for computational cost:

If you have NVIDIA card on your PC, you can use cuda to accelerate the calculation. However,

theoretically, 2 layers of CNN are enough to have at least 97% test accuracy for this exercise.

If your PC has no worse than Intel Core i5-8250U Processor (as had the TA’s PC in 2019) and your

code is well done, the program shall be guaranteed to run fast.

Conventionally, you are encouraged to construct not too complicated CNN to save computational

cost, since MNIST is a relatively simple problem. However, as long as your code works out for you,

no point will be removed if your CNN is huge.

7