Sale!

EL-GY-9123 Homework 5 Gradient Calculations and Nonlinear Optimization Solved

Original price was: $40.00.Current price is: $35.00. $29.75

Category:

Description

5/5 - (1 vote)

Gradient Calculations and Nonlinear Optimization

1. Suppose we want to fit a model,
yˆ =
1
w0 +
Pd
j=1 wjxj
,

for parameters w. Given training data (xi
, yi), i = 1, . . . , n, a nonlinear least squares fit could
use the loss function,
J(w) = Xn
i=1 ”
yi −
1
w0 +
Pd
j=1 wjxij #2

(a) Find a function g(z) and matrix A such that the loss function is given by,
J(w) = g(z), z = Aw,
and g(z) is factorizable, meaning g(z) = P
i
gi(zi) for some functions gi(zi).

(b) What is the gradient ∇J(w)?

(c) What is the gradient descent update for w?

(d) Write a few lines of python code to compute the loss function J(w) and ∇J(w).

2. In this problem, we will see why gradient descent can often exhibit very slow convergence,
even on apparently simple functions.

Consider the objective function,
J(w) = 1
2
b1w
2
1 +
1
2
b2w
2
2
,

defined on a vector w = (w1, w2) with constants b2 > b1 > 0.
(a) What is the gradient ∇J(w)?

(b) What is the minimum w∗ = arg minw J(w)?

(c) Part (b) shows that we can minimize J(w) easily by hand. But, suppose we tried to
minimize it via gradient descent.

Show that the gradient descent update of w with a
step-size α has the form,
w
k+1
1 = ρ1w
k
1

, wk+1
2 = ρ2w
k
2
,

for some constants ρi
, i = 1, 2. Write ρi
in terms of bi and the step-size α.
1

(d) For what values α will gradient descent converge to the minimum? That is, what step
sizes guarantee that wk → w∗
.

(e) Take α = 2/(b1 + b2). It can be shown that this choice of α results in the fastest
convergence. You do not need to show this.

But, show that with this selection of α,
kwk
k = C
k
kw0
k, C =
κ − 1
κ + 1
, κ =
b2
b1
.

The term κ is called the condition number. The above calculation shows that when κ is
very large, C ≈ 1 and the convergence of gradient descent is slow. In general, gradient
descent performs poorly when the problems are ill-conditioned like this.

3. Matrix minimization. Consider the problem of finding a matrix P ∈ R
m×m to minimize the
loss function,
J(P) = Xn
i=1

zi
yi
− ln(zi)

, zi = x
T
i Pxi
.

The problem arises in wireless communications where an m-antenna receiver wishes to estimate a spatial covariance matrix P from n power measurements. In this setting, yi > 0 is
the i-th receive power measurement and xi
is the beamforming direction for that measurement.

In reality, the quantities would be complex, but for simplicity we will just look at the
real-valued case. See the following article for more details:

Eliasi, Parisa A., Sundeep Rangan, and Theodore S. Rappaport. “Low-rank spatial
channel estimation for millimeter wave cellular systems,” IEEE Transactions on
Wireless Communications 16.5 (2017): 2748-2759.

(a) What is the gradient ∇Pzi?

(b) What is the gradient ∇PJ(P)?

(c) Write a few lines of python code to evaluate J(P) and ∇PJ(P) given data xi and yi
.
You can use a for loop.

(d) See if you can rewrite (c) without a for loop. You will need Python broadcasting.

4. Nested optimization. Suppose we are given a loss function J(w1, w2) with two parameter
vectors w1 and w2. In some cases, it is easy to minimize over one of the sets of parameters,
say w2, while holding the other parameter vector (say, w1) constant.

In this case, one could
perform the following nested minimization:

Define
J1(w1) := min
w2
J(w1, w2), wb2(w1) := arg min
w2
J(w1, w2),
which represent the minimum and argument of the loss function over w2 holding w1 constant.

Then,
wb1 = arg min
w1
J1(w1) = arg min
w1
min
w2
J(w1, w2).

Hence, we can find the optimal w1 by minimizing J1(w1) instead of minimizing J(w1, w2)
over w1 and w2.
2

(a) Show that the gradient of J1(w1) is given by
∇w1 J1(w1) = ∇w1 J(w1, w2)|w2=wb 2
.

Thus, given w1, we can evaluate the gradient from (i) solve the minimization wb2 :=
arg minw2
J(w1, w2); and (ii) take the gradient ∇w1 J(w1, w2) and evaluate at w2 = wb2.

(b) Suppose we want to minimize a nonlinear least squares,
J(a, b) := Xn
i=1

yi −
X
d

j=1
bje
−ajxi


2
,

over two parameters a and b. Given parameters a, describe how we can minimize over
b. That is, how can we compute,
bˆ := arg min
b
J(a, b).

(c) In the above example, how would we compute the gradients,
∇aJ(a, b).
3