Sale!

DSGA1003 Homework 2 Lasso Regression Solved

Original price was: $40.00.Current price is: $35.00. $29.75

Category:

Description

5/5 - (1 vote)

Lasso Regression

1 Introduction

In this homework you will investigate regression with `1 regularization, both implementation techniques and theoretical properties. On the methods side, you’ll work on coordinate descent (the
“shooting algorithm”), homotopy methods, and [optionally] projected SGD.

On the theory side
you’ll deriver the explicit solution to the coordinate minimizers used in coordinate descent, you’ll
derive the largest `1 regularization parameter you’ll ever need to try, you’ll investigate what happens
with ridge and lasso regression when you have two copies of the same feature, and you’ll [optionally]
work out the details of the classic picture that “explains” why `1 regularization leads to sparsity.

For the experiments, we constructed a small dataset with known properties. The file generate data.py
contains the code used to generate the data. Section 1.1 describes the steps followed to construct
the dataset.

1.1 Dataset construction

Start by creating a design matrix for regression with m = 150 examples, each of dimension d = 75.
We will choose a true weight vector θ that has only a few non-zero components:
1. Let X ∈ Rm×d be the “design matrix,” where the i’th row of X is xi ∈ Rd
. Construct a
random design matrix X using numpy.random.rand() function.

2. Construct a true weight vector θ ∈ Rd×1 as follows: Set the first 10 components of θ to 10
or -10 arbitrarily, and all the other components to zero.

3. Let y = (y1, . . . , ym)
T
∈ Rm×1 be the response. Construct the vector y = Xθ + , where  is
an m × 1 random noise vector where each entry is sampled i.i.d. from a normal distribution
with mean 0 and standard deviation 0.1. You can use numpy.random.randn() to generate
 in Python.

 

4. Split the dataset by taking the first 80 points for training, the next 20 points for validation,
and the last 50 points for testing.
Note that we are not adding an extra feature for the bias in this problem. By construction, the
true model does not have a bias term.

2 Ridge Regression

By construction, we know that our dataset admits a sparse solution. Here, we want to evaluate the
performance of ridge regression (i.e. `2-regularized linear regression) on this dataset.

1. Run ridge regression on this dataset. Choose the λ that minimizes the square loss on the validation set. For the chosen λ, examine the model coefficients. Report on how many components
with true value 0 have been estimated to be non-zero, and vice-versa (don’t worry if they are
all nonzero).

Now choose a small threshold (say 10−3 or smaller), count anything with magnitude smaller than the threshold as zero, and repeat the report. (For running ridge regression,
you may either use your code from HW1, or you may use scipy.optimize.minimize (see
the demo code provided for guidance). For debugging purposes, you are welcome, even encouraged, to compare your results to what you get from sklearn.linear model.Ridge.)

3 Coordinate Descent for Lasso (a.k.a. The Shooting algorithm)

The Lasso optimization problem can be formulated as
wˆ = arg min
w∈Rd
Xm
i=1
(hw(xi) − yi)
2 + λkwk1,
where hw(x) = w
T x, and kwk1 =
Pd
i=1 |wi

|. Since the `1-regularization term in the objective
function is non-differentiable, it’s not clear how gradient descent or SGD could be used to solve
this optimization problem. (In fact, as we’ll see in the next homework on SVMs, we can use
“subgradient” methods when the objective function is not differentiable.)

Another approach to solving optimization problems is coordinate descent, in which at each step
we optimize over one component of the unknown parameter vector, fixing all other components.

The descent path so obtained is a sequence of steps each of which is parallel to a coordinate axis
in Rd
, hence the name. It turns out that for the Lasso optimization problem, we can find a closed
form solution for optimization over a single component fixing all other components. This gives us
the following algorithm, known as the shooting algorithm:

(Source: Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.)
The “soft thresholding” function is defined as
soft (a, δ) = sign(a) (|a| − δ)+ .

NOTE: Algorithm 13.1 does not account for the case that aj = cj = 0, which occurs when
the jth column of X is identically 0. One can either eliminate the column (as it cannot possibly
help the solution), or you can set wj = 0 in that case since it is, as you can easily verify, the
coordinate minimizer.

Note also that Murphy is suggesting to initialize the optimization with the
ridge regession solution. Although theoretically this is not necessary (with exact computations and
enough time, coordinate descent will converge for lasso from any starting point), in practice it’s
helpful to start as close to the solution as we’re able.

There are a few tricks that can make selecting the hyperparameter λ easier and faster. First, you
can show that for any λ ≥ 2kXT
(y − y¯)k∞, the estimated weight vector ˆw is entirely zero, where ¯y
is the mean of values in the vector y, and k · k∞ is the infinity norm (or supremum norm), which is
the maximum over the absolute values of the components of a vector. Thus we need to search for an
optimal λ in [0, λmax], where λmax = 2kXT
(y − y¯)k∞. (

 

Note: This expression for λmax assumes we
have an unregularized bias term in our model. That is, our decision functions are hw,b(x) = w
T x+b.

For the experiments, you can exclude the bias term, in which case λmax = 2kXT yk∞.)
The second trick is to use the fact that when λ and λ
0 are close, the corresponding solutions
wˆ(λ) and ˆw(λ
0
) are also close. Start with λ = λmax, for which we know ˆw(λmax) = 0. You can
run the optimization anyway, and initialize the optimization at w = 0.

 

Next, λ is reduced (e.g.
by a constant factor), and the optimization problem is solved using the previous optimal point as
the starting point. This is called warm starting the optimization. The technique of computing
a set of solutions for a chain of nearby λ’s is called a continuation or homotopy method. The
resulting set of parameter values ˆw(λ) as λ ranges over [0, λmax ] is known as a regularization
path.

3.1 Experiments with the Shooting Algorithm

1. Write a function that computes the Lasso solution for a given λ using the shooting algorithm
described above. This function should take a starting point for the optimization as a parameter. Run it on the dataset constructed in (1.1), and select the λ that minimizes the square
error on the validation set. Report the optimal value of λ found, and the corresponding test
error.

Plot the validation error vs λ. [Don’t use the homotopy method in this part, as we want to measure the speed improvement of homotopy methods in question 3. Also, no need
to vectorize the calculations until question 4, where again we’ll compare the speedup. In any
case, having two different implementations of the same thing is a good way to check your
work.]

2. Analyze the sparsity1 of your solution, reporting how many components with true value zero
have been estimated to be non-zero, and vice-versa.

3. Implement the homotopy method described above. Compare the runtime for computing the
full regularization path (for the same set of λ’s you tried in the first question above) using
the homotopy method compared to the basic shooting algorithm.

4. The algorithm as described above is not ready for a large dataset (at least if it has been
implemented in basic Python) because of the implied loop over the dataset (i.e. where we
sum over the training set).

By using matrix and vector operations, we can eliminate the loops.
This is called “vectorization” and can lead to dramatic speedup in languages such as Python,
Matlab, and R. Derive matrix expressions for computing aj and cj . (Hint: A matlab version
of this vectorized method can be found in the assignment zip file.) Implement the matrix
expressions and measure the speedup in computing the regularization path.

3.2 Deriving the Coordinate Minimizer for Lasso

This problem is to derive the expressions for the coordinate minimizers used in the Shooting algorithm. This is often derived using subgradients (slide 15), but here we will take a bare hands
approach (which is essentially equivalent).
In each step of the shooting algorithm, we would like to find the wj minimizing
f(wj ) = Xn
i=1

w
T xi − yi
2
+ λ |w|
1

=
Xn
i=1

wjxij +
X
k6=j
wkxik − yi


2

+ λ |wj | + λ
X
k6=j
|wk| ,
where we’ve written xij for the jth entry of the vector xi

This function is strictly convex in wj ,
and thus it has a unique minimum. The only thing keeping f from being differentiable is the term
with |wj |. So f is differentiable everywhere except wj = 0.

We’ll break this problem into 3 cases:
wj > 0, wj < 0, and wj = 0. In the first two cases, we can simply differentiate f w.r.t. wj to get
optimality conditions. For the last case, we’ll use the fact that since f : R → R is convex, 0 is a
minimizer of f iff
lim
ε↓0
f(ε) − f(0)
ε
≥ 0 and lim
ε↓0
f(−ε) − f(0)
ε
≥ 0.
1One might hope that the solution will a sparsity pattern that is similar to the ground truth. Estimators that
preserve the sparsity pattern (with enough training data) are said to be “sparsistent” (sparse + consistent).
Formally, an estimator βˆ of parameter β is said to be consistent if the estimator βˆ converges to the true value β in
probability as our sample size goes to infinity.

Analogously, if we define the support of a vector β as the indices with
non-zero components, i.e. Supp(β) = {j | βj 6= 0}, then an estimator βˆ is said to be sparsistent if as the number of
samples becomes large, the support of βˆ converges to the support of β, or lim m→∞
P[Supp(βˆm) = Supp(β)] = 1.

 

This is a special case of the optimality conditions described in slide 12 here, where now the “direction” v is simply taken to be the scalars 1 and −1, respectively.
1. First let’s get a trivial case out of the way. If xij = 0 for i = 1, . . . , n, what is the coordinate
minimizer wj ? In the remaining questions below, you may assume that Pn
i=1 x
2
ij > 0.

2. Give an expression for the derivative f(wj ) for wj 6= 0. It will be convenient to write your
expression in terms of the following definitions:
sign(wj ) :=



1 wj > 0
0 wj = 0
−1 wj < 0
aj := 2Xn
i=1
x
2

ij
cj := 2Xn
i=1
xij

yi −
X
k6=j
wkxik


 .
3. If wj > 0 and minimizes f, show that wj =
1
aj
(cj − λ). Similarly, if wj < 0 and minimizes f,
show that wj =
1
aj

(cj + λ). Give conditions on cj that imply that a minimizer wj is positive
and conditions for which a minimizer wj is negative.
4. Derive expressions for the two one-sided derivatives at f(0), and show that cj ∈ [−λ, λ] implies
that wj = 0 is a minimizer.

5. Putting together the preceding results, we conclude the following:
wj =



1
aj
(cj − λ) cj > λ
0 cj ∈ [−λ, λ]
1
aj
(cj + λ) cj < −λ
Show that this is equivalent to the expression given in 3.

4 Lasso Properties
4.1 Deriving λmax

In this problem we will derive an expression for λmax. For the first three parts, use the Lasso
objective function excluding the bias term i.e, L(w) = kXw − yk
2
2 + λ kwk1

. Show that for any
λ ≥ 2kXT yk∞, the estimated weight vector ˆw is entirely zero, where k · k∞ is the infinity norm (or
supremum norm), which is the maximum absolute value of any component of the vector.
1. The one-sided directional derivative of f(x) at x in the direction v is defined as:
f
0
(x; v) = lim
h↓0
f(x + hv) − f(x)
h
5
Compute L
0
(0; v). That is, compute the one-sided directional derivative of L(w) at w = 0 in
the direction v. [Hint: the result should be in terms of X, y, λ, and v.]

2. Since the Lasso objective is convex, for w

to be a minimizer of L(w) we must have that
the directional derivative L
0
(w

; v) ≥ 0 for all v. Starting from the condition L
0
(0; v) ≥ 0,
rearrange terms to get a lower bounds on λ. [Hint: this should be in terms of X, y, and v.]

3. In the previous problem, we get a different lower bound on λ for each choice of v. Compute the
maximum lower bound of λ by maximizing the expression over v. Show that this expression
is equivalent to λmax = 2kXT yk∞.

4. [Optional] Show that for L(w, b) = kXw + b1 − yk
2
2 + λ kwk1
, λmax = 2kXT
(y − y¯)k∞where
y¯ is the mean of values in the vector y, and 1 ∈ Rn is a column vector of 1’s .

4.2 Feature Correlation

In this problem, we will examine and compare the behavior of the Lasso and ridge regression in
the case of an exactly repeated feature. That is, consider the design matrix X ∈ Rm×d
, where
X·i = X·j for some i and j, where X·i
is the i
th column of X. We will see that ridge regression
divides the weight equally among identical features, while Lasso divides the weight arbitrarily.

In an
optional part to this problem, we will consider what changes when X·i and X·j are highly correlated
(e.g. exactly the same except for some small random noise) rather than exactly the same.
1. Without a loss of generality, assume the first two colums of X are our repeated features.
Partition X and θ as follows:
X =

x1 x2 Xr

, θ =


θ1
θ2
θr

We can write the Lasso objective function as:
L(θ) = kXθ − yk
2
2 + λ kθk1
= kx1θ1 + x2θ2 + Xrθr − yk
2
2 + λ|θ1| + λ|θ2| + λ kθrk1

With repeated features, there will be multiple minimizers of L(θ). Suppose that
ˆθ =


a
b
r


is a minimizer of L(θ).

Give conditions on c and d such that
c, d, rT
T
is also a minimizer of
L(θ).

[Hint: First show that a and b must have the same sign, or at least one of them is zero.
Then, using this result, rewrite the optimization problem to derive a relation between a and b.]
2. Using the same notation as the previous problem, suppose
ˆθ =


a
b
r


minimizes the ridge regression objective function. What is the relationship between a and b,
and why?

3. [Optional] What do you think would happen with Lasso and ridge when X·i and X·j are highly
correlated, but not exactly the same. You may investigate this experimentally or theoretically.

5 [Optional] The Ellipsoids in the `1/`2 regularization picture

Recall the famous picture purporting to explain why `1 regularization leads to sparsity, while `2
regularization does not. Here’s the instance from Hastie et al’s The Elements of Statistical Learning:
(While Hastie et al. use β for the parameters, we’ll continue to use w.)

In this problem we’ll show that the level sets of the empirical risk are indeed ellipsoids centered
at the empirical risk minimizer ˆw.
Consider linear prediction functions of the form x 7→ w
T x. Then the empirical risk for f(x) =
w
T x under the square loss is

n(w) = 1
n
Xn
i=1

w
T xi − yi
2
=
1
n
(Xw − y)
T
(Xw − y).

 

1. [Optional] Let ˆw =

XT X
−1 XT y. Show that ˆw has empirical risk given by

n( ˆw) = 1
n

−y
T Xwˆ + y
T
y


2. [Optional] Show that for any w we have

n(w) = 1
n
(w − wˆ)
T XT X (w − wˆ) + Rˆ
n( ˆw).

Note that the RHS (i.e. “right hand side”) has one term that’s quadratic in w and one term
that’s independent of w. In particular, the RHS does not have any term that’s linear in w. On
the LHS (i.e. “left hand side”), we have Rˆ
n(w) = 1
n
(Xw − y)
T
(Xw − y).

After expanding
this out, you’ll have terms that are quadratic, linear, and constant in w. Completing the
square is the tool for rearranging an expression to get rid of the linear terms. The following
“completing the square” identity is easy to verify just by multiplying out the expressions on
the RHS:
x
TMx − 2b
T x =

 

x − M−1
b
T M(x − M−1
b) − b
TM−1
b

3. [Optional] Using the expression derived for Rˆ
n(w) in 2, give a very short proof that ˆw =

 

XT X
−1 XT y is the empirical risk minimizer. That is:
wˆ = arg min
w

n(w).

Hint: Note that XT X is positive semidefinite and, by definition, a symmetric matrix M is
positive semidefinite iff for all x ∈ Rd
, x
TMx ≥ 0.

4. [Optional] Give an expression for the set of w for which the empirical risk exceeds the minimum empirical risk Rˆ
n( ˆw) by an amount c > 0. If X is full rank, then XT X is positive
definite, and this set is an ellipse – what is its center?

6 [Optional] Projected SGD via Variable Splitting

In this question, we consider another general technique that can be used on the Lasso problem. We
first use the variable splitting method to transform the Lasso problem to a smooth problem with
linear inequality constraints, and then we can apply a variant of SGD

Representing the unknown vector θ as a difference of two non-negative vectors θ
+ and θ
−, the
`1-norm of θ is given by X
d
i=1
θ
+
i +
X
d
i=1
θ

i

. Thus, the optimization problem can be written as
(
ˆθ
+,
ˆθ
−) = arg min
θ+,θ−∈Rd
Xm
i=1

(hθ+,θ− (xi) − yi)
2 + λ
X
d
i=1
θ
+
i + λ
X
d
i=1
θ

i

such that θ
+ ≥ 0 and θ
− ≥ 0,
where hθ+,θ− (x) = (θ
+ −θ
−)
T x. The original parameter θ can then be estimated as ˆθ = (ˆθ
+ − ˆθ
−).

This is a convex optimization problem with a differentiable objective and linear inequality
constraints. We can approach this problem using projected stochastic gradient descent, as discussed
in lecture. Here, after taking our stochastic gradient step, we project the result back into the feasible
set by setting any negative components of θ
+ and θ
− to zero.

1. [Optional] Implement projected SGD to solve the above optimization problem for the same
λ’s as used with the shooting algorithm. Since the two optimization algorithms should find
essentially the same solutions, you can check the algorithms against each other.

Report the
differences in validation loss for each λ between the two optimization methods. (You can
make a table or plot the differences.)

2. [Optional] Choose the λ that gives the best performance on the validation set. Describe the
solution ˆw in term of its sparsity. How does the sparsity compare to the solution from the
shooting algorithm?
9