## Description

Gradient Descent

1. (3 points) We often use iterative optimization algorithms such as Gradient Descent to find w

that minimizes a loss function f(w). Recall that in gradient descent, we start with an initial

1

value of w (say w(1)) and iteratively take a step in the direction of the negative of the gradient

of the objective function i.e.

w(t+1) = w(t) − η∇f(w(t)

) (1)

for learning rate η > 0.

In this question, we will develop a slightly deeper understanding of this update rule, in particular for minimizing a convex function f(w). Note: this analysis will not directly carry over

to training neural networks since loss functions for training neural networks are typically not

convex, but this will (a) develop intuition and (b) provide a starting point for research in

non-convex optimization (which is beyond the scope of this class).

Recall the first-order Taylor approximation of f at w(t)

:

f(w) ≈ f(w(t)

) + hw − w(t)

, ∇f(w(t)

)i (2)

When f is convex, this approximation forms a lower bound of f, i.e.

f(w) ≥ f(w(t)

) + hw − w(t)

, ∇f(w(t)

)i

| {z }

affine lower bound to f(·)

∀w (3)

Since this approximation is a ‘simpler’ function than f(·), we could consider minimizing the

approximation instead of f(·). Two immediate problems: (1) the approximation is affine (thus

unbounded from below) and (2) the approximation is faithful for w close to w(t)

. To solve

both problems, we add a squared `2 proximity term to the approximation minimization:

argmin

w

f(w(t)

) + hw − w(t)

, ∇f(w(t)

)i

| {z }

affine lower bound to f(·)

+

λ

2

|{z}

trade-off

w − w(t)

2 CS 7643

| {z }

proximity term

(4)

Notice that the optimization problem above is an unconstrained quadratic programming problem, meaning that it can be solved in closed form (hint: gradients).

What is the solution w∗ of the above optimization? What does that tell you about the gradient

descent update rule? What is the relationship between λ and η?

2. (3 points) Let’s prove a lemma that will initially seem devoid of the rest of the analysis but

will come in handy in the next sub-question when we start combining things. Specifically, the

analysis in this sub-question holds for any w?

, but in the next sub-question we will use it for

w?

that minimizes f(w).

Consider a sequence of vectors v1, v2, …, vT , and an update equation of the form w(t+1) =

w(t) − ηvt with w(1) = 0. Show that:

X

T

t=1

hw(t) − w?

, vti ≤ ||w?

||2

2η

+

η

2

X

T

t=1

||vt

||2

(5)

3. (3 points) Now let’s start putting things together and analyze the convergence rate of gradient

descent i.e. how fast it converges to w?

.

First, show that for ¯w =

1

T

PT

t=1 w(t)

2

f(¯w) − f(w?

) ≤

1

T

X

T

t=1

hw(t) − w?

, ∇f(w(t)

)i (6)

Next, use the result from part 2, with upper bounds B and ρ for ||w?

|| and

∇f(w(t)

)

respectively and show that for fixed η =

q B2

ρ

2T

, the convergence rate of gradient descent is

O(1/

√

T) i.e. the upper bound for f(¯w) − f(w?

) ∝ √

1

T

.

4. (2 points) Consider an objective function f(w) := f1(w) + f2(w) comprised of N = 2 terms:

f1(w) = − ln

1 −

1

1 + exp(−w)

and f2(w) = − ln

1

1 + exp(−w)

(7)

Now consider using SGD (with a batch-size B = 1) to minimize f(w). Specifically, in each

iteration, we will pick one of the two terms (uniformly at random), and take a step in the

direction of the negative gradient, with a constant step-size of η. You can assume η is small

enough that every update does result in improvement (aka descent) on the sampled term.

Is SGD guaranteed to decrease the overall loss function in every iteration? If yes, provide a

proof. If no, provide a counter-example.

2 Automatic Differentiation

5. (4 points) In practice, writing the closed-form expression of the derivative of a loss function f

w.r.t. the parameters of a deep neural network is hard (and mostly unnecessary) as f becomes

complex. Instead, we define computation graphs and use the automatic differentiation algorithms (typically backpropagation) to compute gradients using the chain rule. For example,

consider the expression

f(x, y) = (x + y)(y + 1) (8)

Let’s define intermediate variables a and b such that

a = x + y (9)

b = y + 1 (10)

f = a × b (11)

A computation graph for the “forward pass” through f is shown in Fig. 1.

Figure 1

We can then work backwards and compute the derivative of f w.r.t. each intermediate variable ( ∂f

∂a and ∂f

∂b ) and chain them together to get ∂f

∂x and ∂f

∂y .

3 CS 7643

Let σ(·) denote the standard sigmoid function. Now, for the following vector function:

f1(w1, w2) = e

e

w1+e

2w2 + σ(e

w1 + e

2w2

) (12)

f2(w1, w2) = w1w2 + max(w1, w2) (13)

(a) Draw the computation graph. Compute the value of f at ~w = (1, −1).

(b) At this ~w, compute the Jacobian ∂f~

∂ ~w using numerical differentiation (using ∆w = 0.01).

(c) At this ~w, compute the Jacobian using forward mode auto-differentiation.

(d) At this ~w, compute the Jacobian using backward mode auto-differentiation.

(e) Don’t you love that software exists to do this for us?

3 Paper Review

The first of our paper reviews for this course comes from a much acclaimed spotlight presentation

at NeurIPS 2019 on the topic ‘Weight Agnostic Neural Networks’ by Adam Gaier and David Ha

from Google Brain.

The paper presents a very interesting proposition that, through a series of experiments, re-examines

some fundamental notions about neural networks – in particular, the comparative importance of

architectures and weights in a network’s predictive performance.

The paper can be viewed here. The authors have also written a blog post with intuitive visualizations

to help understand its key concepts better.

Guidelines: Please restrict your reviews to no more than 350 words. The evaluation rubric for this

section is as follows :

6. (2 points) What is the main contribution of this paper? Briefly summarize its key insights,

strengths and weaknesses.

7. (2 points) What is your personal takeaway from this paper? This could be expressed either

in terms of relating the approaches adopted in this paper to your traditional understanding

of learning parameterized models, or potential future directions of research in the area which

the authors haven’t addressed, or anything else that struck you as being noteworthy.

4 Implement and train a network on CIFAR-10

Setup Instructions: Before attempting this question, look at setup instructions at here.

8. (Upto 29 points) Now, we will learn how to implement a softmax classifier, vanilla neural

networks (or Multi-Layer Perceptrons), and ConvNets. You will begin by writing the forward

and backward passes for different types of layers (including convolution and pooling), and

then go on to train a shallow ConvNet on the CIFAR-10 dataset in Python. Next you will

learn to use PyTorch, a popular open-source deep learning framework, and use it to replicate

the experiments from before.

Follow the instructions provided here

4 CS 7643: Problem Set 1