Sale!

TDT 4265 Computer Vision and Deep Learning Assignment 1 solved

Original price was: $35.00.Current price is: $30.00. $25.50

Category:

Description

5/5 - (5 votes)

Introduction
In this assignment we will explore the power of gradient descent to perform binary classification with
Logistic Regression. Then, we will explore multi-class classification with Softmax Regression on the
MNIST dataset. Further, you will experiment with weight regularization through L2 norm, simple
visualization of weights and basic visualization to present your result.
With this assignment, we provide you starter code for the programming tasks. You can download this
from:
https://github.com/hukkelas/TDT4265-StarterCode.
Python files: For this assignment we require you to use the provided files and you are not allowed to
create any additional files for your code (except task 4).
Report outline: We’ve included a jupyter notebook to help you create your report. You do not have
not use this file. Remember to export the jupyter notebook to PDF before submitting it to blackboard.
To set up your environment, follow the guide in the Github repo:
https://github.com/hukkelas/TDT4265-StarterCode/blob/master/python setup instructions.md
Recommended readings
1. See ”Recommended Resources” on blackboard.
2. 3Blue1Brown: What is a Neural Network?
3. 3Blue1Brown Gradient Descent.
4. Neural Networks and Deep Learning: Chapter 1.
Delivery
We ask you to follow these guidelines:
• Report: Deliver your answers as a single PDF file. Include all tasks in the report, and mark
it clearly with the task you are answering (Task 1.a, Task1.b, Task 2.c etc). There is no need to
include your code in the report.
• Plots in report: For the plots in the report, ensure that they are large and easily readable.
You might want to use the ”ylim” function in the matplotlib package to ”zoom” in on your plots.
Label the different graphs such that it is easy for us to see which graphs correspond to the train,
validation and test set.
• Source code: Upload your code as a zip file. In the assignment starter code, we have included
a script (create_submission_zip.py) to create your delivery zip. Please use this, as this will
structure the zipfile as we expect. (Run this from the same folder as all the python files).
To use the script, simply run: python3 create_submission_zip.py
• Upload to blackboard: Upload the ZIP file with your source code and the report to blackboard
before the delivery deadline.
Any group who does not follow these guidelines will be subtracted in points.
2 TDT 4265
Task 1: Theory
Notation: We use index k to represent a node in the output layer, index j to represent a node in the
hidden layer, and index i to represent an input unit xi
. Hence, the weight from node i in the input
layer to node k in the output layer is wki. We write the activation of output unit k as ˆyk = f(zk),
where f represents the output unit activation function (sigmoid for logistic regression or softmax for
softmax regression). In equations where multiple training examples are used (for example summing over
samples), we will use n to specify which training example we are referring to. Hence, y
n
k
is the output of
node k for training example n. If we do not specify n by writing yk, we implicitly refer to y
n
k
. Capital
letters N, I, J or K refer the the total number of nodes in a layer. For logistic regression we will not
specify which output node we’re referring to, as it is only one output node in the network. Therefore
weight wk,i can be written as wi as there is only a single output node k.
Logistic Regression
Logistic regression is a simple tool to perform binary classification. Logistic regression can be modeled as
using a single neuron reading a input vector x ∈ R
I+1 and parameterized by a weight vector w ∈ R
I+1
.
I is the number of input nodes, and we add a 1 at the beginning for a bias parameter (this is known
as the ”bias trick”). The neuron outputs the probability that x is a member of class C1. This can be
written as,
P(x ∈ C1|x) = f(x) = 1
1 + e−wT x
, wT x =
X
I
i
wi
· xi (1)
P(x ∈ C2|x) = 1 − P(x ∈ C1|x) = 1 − f(x) (2)
where f(x) returns the probability of x being a member of class C1; f ∈ [0, 1] 1
. By defining the output
of our network as ˆy, we have ˆy = f(x).
We use the cross entropy loss function (Equation 3) for two categories to measure how well our function
performs over our dataset. This loss function measures how well our hypothesis function f does over the
N data points.
C(w) = 1
N
X
N
n=1
C
n
, where C
n
(w) = −(y
n
ln(ˆy
n
) + (1 − y
n
) ln(1 − yˆ
n
)) (3)
Here, y
n is the target value (also known as the label of the image). Note that we are computing the
average cost function, such that the magnitude of our cost function is not dependent on number of
training examples. Our goal is to minimize this cost function through gradient descent, such that the
cost function reaches a minimum of 0. This happens when y
n = ˆy
n for all n.
1The function f is known as the sigmoid activation function
3 TDT 4265
Task 1a: Derive the gradient for Logistic Regression (0.55 points)
To minimize the cost function with gradient descent, we require the gradient of the cost function. Show
that for Equation 3, the gradient is:
∂Cn(w)
∂wi
= −(y
n − yˆn)x
n
i
(4)
Show thorough work such that your approach is clear.
Hint: To solve this, you have to use the chain rule. Also, you can use the fact that:
∂f(x
n)
∂wi
= x
n
i
f(x
n
)(1 − f(x
n
)) (5)
Softmax Regression
Softmax regression is simply a generalization of logistic regression to multi-class classification. Given an
input x which can belong to K different classes, softmax regression will output a vector ˆy (with length
K), where each element ˆyk represents the probability that x is a member of class k.
yˆk =
e
zk
PK
k0 e
zk0
, where zk = w
T
k
· x =
X
i
w
I
k,i · xi (6)
Equation 6 is known as the Softmax function and it has the attribute that PK
k
yˆk = 1. Note that now
w is no long a vector, but a weight matrix, w ∈ RK×I
.
The cross-entropy cost function for multiple classes is defined as,
C(w) = 1
N
X
N
n=1
C
n
(w), Cn
(w) = −
X
K
k=1
y
n
k
ln(ˆy
n
k
) (7)
Task 1b: Derive the gradient for Softmax Regression (0.75 points)
For the multi-class cross entropy cost in Equation 7, show that the gradient is:
∂Cn(w)
∂wkj
= −x
n
j
(y
n
k − yˆ
n
k
) (8)
A few hints if you get stuck:
• Derivation of the softmax is the hardest part. Break it down into two cases.

PK
k=1 y
n
k = 1
• ln( a
b
) = ln a − ln b
4 TDT 4265
Task 2: Logistic Regression through Gradient Descent
In this assignment you are going to start classifying digits in the well-known dataset MNIST. The
MNIST dataset consists of 70, 000 handwritten digits, split into 10 object classes (the numbers 0-9). The
images are 28×28 grayscale images, and every image is perfectly labeled. The images are split into two
datasets, a training set consisting of 60, 000 images, and a testing set consisting of 10, 000 images. For
this assignment, we will use a subset of the MNIST dataset2
.
Bias trick: Each image is 28×28, so the unraveled vector will be x ∈ R
784. For each image, append a ’1’
to it, giving us x ∈ R
785. With this trick, we don’t need to implement the forward and backward pass
for the bias.
Logistic Regression through gradient descent
For this task, we will use mini-batch gradient descent to train a logistic regression model to predict if an
image is either a 2 or a 3. We will remove all images in the MNIST dataset that are not a 2 or 3 (this
pre-processing is already implemented in the starter code). The target is 1 if the the input is from the
”2” category, and 0 otherwise.
Mini-batch gradient descent is a method that takes a batch of images to compute an average gradient,
then use this gradient to update the weights. Use the gradient derived for logistic regression to classify
x ∈ R
785 for the two categories 2’s and 3’s.
Vectorizing code: We recommend you to vectorize the code with numpy, which will make the runtime of
your code significantly faster. Note that vectorizing your code is not required, but highly recommended
(it will be required for assignment 2). Vectorizing it simply means that if you want to, for example,
compute the gradient in Equation 4, you can compute it in one go instead of iterating over the number
of examples and weights. For example, w
T x can be witten as w.dot(x).
For this task, please:
(a) [1pt] Before implementing our gradient descent training loop, we will implement a couple of essential
functions. Implement four functions in the file task2a.py.
• Implement a function that pre-processes our images in the function pre_process_images.
This should normalize our images from the range [0, 255] to [−1, 1] 3
, and it should apply the
bias trick.
• Implement a function that performs the forward pass through our single layer neural network.
Implement this in the function forward. This should implement the network outlined by
Equation 1.
• Implement a function that performs the backward pass through our single layer neural network. Implement this in the function backward. To find the gradient for our weight, we can
use the equation derived in task 1 (Equation 4).
• Implement cross entropy loss in the function cross_entropy_loss. This should compute the
average of the cross entropy loss over all targets/labels and predicted outputs. The cross
entropy loss is shown in Equation 3.
We have included a couple of simple tests to help you debug your code. This also includes a
gradient approximation test that you should get working. For those interested, this is explained in
more detail in the Appendix.
2We will only use the first 20, 000 images in the training set to reduce computation time.
3Normalizing the input to be zero-centered improves convergence for neural networks. Read more about this in Lecun
et al. Efficient Backprop Section 4.3.
5 TDT 4265
Note that you should not start on the subsequent tasks before all tests are passing!
(b) [0.7pt] Implement logistic regression with mini-batch gradient descent for a single layer neural
network. The network should consist of a single weight matrix with 784 + 1 inputs and a single
output (the matrix will have shape 785×1). Initialize the weights (before any training) to all zeros.
We’ve set the default hyperparameters for you, so there is no need to change these.
During training, track the training loss for each gradient step (this is implemented in the starter
code). Less frequently, track the validation loss over the whole validation set (in the starter code,
this is tracked every time we progress 20% through the training set).
Implement this in the function train_step in task2.py.
(report) In your report, include a plot of the training and validation loss over training. Have the
number of gradient steps on the x-axis, and the loss on the y-axis. Use the ylim function to zoom
in on the graph (for us, ylim([0, 0.2]) worked fine).
(c) [0.3pt] Implement a function that computes the binary classification accuracy4 over a dataset.
Implement this in the function calculate_accuracy.
(report) Compute the accuracy on the training set and validation set over training. Plot this in
a graph (similar to the loss), and include the plot in your report. Use the ylim function to zoom
in on the graph (for us, ylim([0.93, 0.99]) worked fine).
Early Stopping: Early stopping is a tool to stop the training before your model overfits on the training
dataset. By using a validation set along with our training set5
, we can regularly check if our model is
starting to overfit or not. If we notice that the cost function on the validation set stops to improve, we
can stop the training and return the weights at the minimum validation loss.
Dataset shuffling: Shuffling the training dataset between each epoch improves convergence. By using
shuffling you present a new batch of examples each time which the network has never seen, which will
produce larger errors and improve gradient descent 6
.
(d) [0.3pt] Implement early stopping into your training loop. Use the following early stop criteria:
stop the training if the validation loss does not improve after passing through 20% of the training
dataset 10 times. Increase the number of epochs to 500. (report) After how many epochs does
early stopping kick in?
You can implement early stopping in the file trainer.py.
(e) [0.4pt] Implement dataset shuffling for your training. Before each epoch, you should shuffle the all
the samples in the dataset. Implement this in the function batch_loader in utils.py
(report) Include a plot in your report of the validation accuracy with and without shuffle. You
should notice that the validation accuracy has less ”spikes”. Why does this happen?
4accuracy =
Number of correct predictions
Total number of predictions . The prediction is determined as 1 if ˆy ≥ 0.5 else 0
5Note that we never train on the validation set.
6For those interested, you can read more about dataset shuffling in Section 4.2 in Efficient Backprop.
6 TDT 4265
Task 3: Softmax Regression through Gradient Descent
In this task, we will perform a 10-way classification on the MNIST dataset with softmax regression. Use
the gradient derived for softmax regression loss and use mini-batch gradient descent to optimize your
model
One-hot encoding: With multi-class classification tasks it is required to one-hot encode the target
values. Convert the target values from integer to one-hot vectors. (E.g: 3 → [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]).
The length of the vector should be equal to the number of classes (K = 10 for MNIST, 1 class per digit).
For this task, please:
(a) [1.1pt] Before implementing our gradient descent training loop, we will implement a couple of
essential functions. Implement four functions in the file task3a.py.
• Implement a function that one-hot encodes our labels in the function one_hot_encode. This
should return a new vector with one-hot encoded labels.
• Implement a function that performs the forward pass through our single layer softmax model.
Implement this in the function forward. This should implement the network outlined by
Equation 6.
• Implement a function that performs the backward pass through our single layer neural network. Implement this in the function backward. To find the gradient for our weight, use
Equation 8.
• Implement cross entropy loss in the function cross_entropy_loss. This should compute the
average of the cross entropy loss over all targets/labels and predicted outputs. The cross
entropy loss is defined in Equation 7.
We have included a couple of simple tests to help you debug your code.
(b) [0.2pt] The rest of the task 3 subtasks should be implemented in task3.py.
Implement softmax regression with mini-batch gradient descent for a single layer neural network.
The network should consist of a single weight matrix, with 784 + 1 inputs and ten outputs (shape
785 × 10). Initialize the weight (before any training) to all zeros.
Implement this in train_step in task3.py. This function should be identical to the task2b, except
that you are using a different cross entropy loss function.
(report) In your report, include a plot of the training and validation loss over training. Have the
number of gradient steps on the x-axis, and the loss on the y-axis. Use the ylim function to zoom
in on the graph (for us, ylim([0.2, .6]) worked fine).
(c) [0.3pt] Implement a function that computes the multi-class classification accuracy over a dataset.
Implement this in the function calculate_accuracy.
(report) Include in your report a plot of the training and validation accuracy over training.
(d) [0.3pt] (report) For your model trained in task 3c, do you notice any signs of overfitting? Explain
your reasoning.
7 TDT 4265
Task 4: Regularization
One way to improve generalization is to use regularization. Regularization is a modification we make
to a learning algorithm that is intended to reduce its generalization error 7
. Regularization is the idea
that we should penalize the model for being too complex. In this assignment, we will carry this out
by introducing a new term in our objective function to make the model ”smaller” by minimizing the
weights.
J(w) = C(w) + λR(w), (9)
where R(w) is the complexity penalty and λ is the strength of regularization (constant). There are
several forms for R, such as L2 regularization
R(w) = ||w||2 =
X
i,j
w
2
i,j , (10)
where w is the weight vector of our model.
For your report, please:
(a) [0.3pt] (report) Derive the update term for softmax regression with L2 regularization, that is, find
∂J
∂w , where C is given by Equation 7.
(b) [0.6pt] Implement L2 regularization in your backward pass. You can implement the regularization
in backward in task3a.py.
For the remaining part of the assignment, you can implement the functionality in the
file task3.py or create a new file.
(report) Train two different models with different λ values for the L2 regularization. Use λ = 0.0
and λ = 1.0. Visualize the weight for each digit for the two models. Why is the weights for the
model with λ = 1.0 less noisy?
The visualization should be similar to Figure 1.
Figure 1: The visualization of the weights for a model with λ = 0.0 (top row), and λ = 1.0 (bottom
row).
(c) [0.4pt] (report) Train your model with different values of λ: 1.0, 0.1, 0.01, 0.001. Note that for
each value of λ, you should train a new network from scratch. Plot the validation accuracy for
different values of λ on the same graph (during training). Have the accuracy on the y-axis, and
number of training steps on the x-axis.
(d) [0.4pt] (report) You will notice that the validation accuracy degrades when applying any amount
of regularization. What do you think is the reason for this?
(e) [0.4pt] (report) Plot the length (L2 norm, ||w||2
) of the weight vector for the each λ value in task
3b. What do you observe? Have the λ value on the x-axis and the L2 norm on the y-axis.
Note that you should plot the L2 norm of the weight after each network is finished training.
7The generalization error can be thought of as training error – validation error.
8 TDT 4265
Appendix
Gradient Approximation test
When implementing neural networks from the bottom up, there can occur several minor bugs that
completely destroy the training process. Gradient approximation is a method to get a numerical approximation to what the gradient should be, and this is extremely useful when debugging your forward,
backward, and cost function. If the test is incorrect, it indicates that there is a bug in one (or more) of
these functions.
It is possible to compute the gradient with respect to one weight by using numerical approximation:
∂Cn
∂wji
=
C
n(wji + ) − C
n(wji − )
2
, (11)
where  is a small constant (e.g. 10−2
), and C(wwij + ) refers to the error on example x
n when weight
wji is set to wji + . The difference between the gradients should be within big-O of 
2
, so if  = 10−2
,
your gradients should agree within O(10−4
).
If your gradient approximation does not agree with your calculated gradient from backpropagation, there
is something wrong with your code!
9 TDT 4265