## Description

1. Alternative regularization formulas. This problem is about two alternative ways

of solving the L2-regularized least squares problem.

a) Prove that for any λ > 0, the following matrix identity holds:

(ATA + λI)

−1AT = AT

(AAT + λI)

−1

Hint: Start by considering the expression ATAAT + λAT and factor it in two

different ways (from the right or from the left).

b) The identity proved in part a) shows that there are actually two equivalent formulas for the solution to the L2-regularized least squares problem. Suppose

A ∈ R

8000×100 and y ∈ R

8000, and use this identity to find w that minimizes

||Aw − y||2

2 + λ||w||2

2

in two different ways. Which formula will compute more

rapidly? Why? Note: The number of operations required for matrix inversion is

proportional to the cube of the matrix dimension.

c) A breast cancer gene database has approximately 8000 genes from 100 subjects.

The label yi

is the disease state of the ith subject (+1 if no cancer, -1 if breast

cancer). Suppose we build a linear classifier that combines the 8000 genes, say

gi

, i = 1, 2, . . . , 100 to predict whether a subject has cancer yˆi = sign{g

T

i w}. Note

that here gi and w are 8000-by-1 vectors.

i. Write down the least squares problem for finding classifier weights w given

100 labels. Does this problem have a unique solution?

ii. Write down a Tikhonov(ridge)-regression problem for finding the classifier

weights given 100 labels. Does this problem have a unique solution? Which

form of the identity in part a) leads to the most computationally efficient

solution for the classifier weights?

2. The key idea behind proximal gradient descent is to reformulate the general regularized

least-squares problem into a set of simpler scalar optimization problems. Consider the

regularized least-squares problem

min

w

||z − w||2

2 + λr(w)

An upper bound and completing the square was used to simplify the generalized leastsquares problem into this form. Let the i

th elements of z and w be zi and wi

, respectively.

1 of 2

a) Assume r(w) = ||w||2

2

. Write the regularized least-squares problem as a series of

separable problems involving only wi and zi

.

b) Assume r(w) = ||w||1. Write the regularized least-squares problem as a series of

separable problems involving only wi and zi

.

3. A script is available to compute a specified number of iterations of the proximal gradient

descent algorithm for solving a Tikhonov-regularized least squares problem

min

w

||y − Xw||2

2 + λ||w||2

2

The provided script will get you started displaying the path taken by the weights in

the proximal gradient descent iteration superimposed on a contour plot of the squared

error surface. Assume y =

√

2

0

1

0

, the 4-by-2 X = UΣV

T has singular value

decomposition U =

1 0

0 1

0 0

0 0

, Σ =

[

1 0

0 1/2

]

, and V = √

1

2

[

1 1

1 −1

]

. Complete

20 iterations of gradient descent in each case specified below.

Include the plots you generate below with your submission.

a) What is the maximum value for the step size τ that will guarantee convergence?

b) Start proximal gradient descent from the point w =

[

−1

1

]

using a step size

of τ = 0.5 and tuning parameter λ = 0.5. How do you explain the trajectory

the weights take toward the optimum, e.g., why is it shaped this way? What

direction does each iteration move in the regularization step?

c) Repeat the previous case with λ = 0.1 What happens? How does λ affect each

iteration and why?

2 of 2