Sale!

ECE M146 Homework 2 Solved

Original price was: $40.00.Current price is: $35.00. $29.75

Category:

Description

5/5 - (1 vote)

1. In class, we see that the solution of the least square problem satisfies the normal equation XT Xw = XT
y, where X ∈ R
N×M and y ∈ R

N . Prove the following statement:
The Gram matrix XT X is nonsingular if and only if X has linearly independent
columns.
With N >> M, there will likely be M linearly independent vectors xn. Therefore, the
Gram matrix XT X is likely to be invertible.

Hint:

X has linearly independent columns if Ax = 0 only for x = 0. You may want to
use the following properties of (non)singular matrix. The equation Ax = 0 has only
the trivial solution x = 0 if and only if A is nonsingular. If A is singular, there exist
z 6= 0 such that Az = 0.
1
2. Consider the hat matrix H = X(XT X)
−1XT
, where X ∈ R

N×M and XT X is invertible.
(a) Show that H is symmetric.
(b) Show that HK = H for any positive integer K.
(c) If I is the identity matrix of size N, show that (I − H)
K = I − H for any positive
integer K.

(d) Show that T race(H) = M, where the trace is the sum of diagonal elements.
Hint:T race(AB) = T race(BA).
2 ECE M146 Homework 2

3. Consider a linear regression problem in which we want to “weigh” different training
instances differently because some of the instances are more important than others.
Specifically, suppose we want to minimize
J(w0, w1) = X
N
n=1
αn(w0 + w1xn,1 − yn)
2
. (1)
Here αn > 0. In class we worked out what happens for the case where all weights
(the α
0
n

s) are the same. In this problem, we will generalize some of those ideas to
the weighted setting. Calculate the gradient by computing the partial derivative of
J with respect to each of the parameters (w0, w1). Comment on how the α
0
n
s affect
the linear regression problem. For example, what happens if αi = 0 for some i?
Qualitatively describe what will happen in gradient descent if αj
is much greater than
other αj
0, j0 6= j.
3 ECE M146 Homework 2

4. In this exercise, you will develop a stochastic gradient descent (SGD) view of the
perceptron algorithm.
(a) Find the gradient ∂J(w)
dw for the following loss function:
J(w) = −
X
i∈M
w
T xiyi
,
where w ∈ R
N , xi ∈ R
N , i = 1, · · · , M and yi ∈ {−1, 1}. The set M denote the
index set of misclassified data points, i.e., M = {i|sign(w
T xi) 6= yi}. You may
assume w

T xi 6= 0, ∀i in this question.
(b) Find the gradient ∂J(w)
dw for the following loss function:
J(w) = X
M
i=1
max[0, −w
T xiyi
],
where w ∈ R
N , xi ∈ R
N , and yi ∈ {−1, 1}. You may assume w
T xi 6= 0, ∀i in this
question.

(c) Compare your answers from (a) and (b), are they equivalent? Describe the update
rule if you use the SGD algorithm to minimize these loss functions. With the
learning rate η = 1, are they equivalent to the perceptron algorithm?
4 ECE M146 Homework 2

5. In this exercise, you will work through linear and polynomial regression. Our data
consists of inputs xn ∈ R and outputs yn ∈ R, n ∈ {1, · · · , N}, which are related
through a target function f(x). Your goal is to learn a predictor hw(x) that best
approximates f(x).

(a) Visualization We provide you two sets of data, the training data and the testing
data in the two files, regression train.csv and regression test.csv. In each file, the
first column is the input and the second column is the output. In MATLAB,
visualize the training data by plotting the input v.s. the output.

What do you
observe? For example, can you make an educated guess on the effectiveness of
linear regression in predicting the data?

(b) Linear Regression: closed-form solution Let us start by considering a simple
linear regression model:
hw(x) = w
T x = w0 + w1x.
Recall that linear regression attempts to minimize the objective function
J(w) = X
N
n=1
(hw(xn) − yn)
2 = kXw − yk
2
,

where
y =





y1
y2
.
.
.

yN





, X =





1, x1
1, x2
.
.
.,
.
.
.
1, xn





, w =

w0
w1

.

Note that to take into account the intercept term w0, we can add an additional
“feature” to each instance and set it to one, e.g., xi,0 = 1. This is equivalent to
adding an additional first column to X and setting it to all ones.
In class we learned that the closed-form solution to linear regression is:
w
∗ = (X
T X)
−1X
T
y.

Implement this closed-form solution in MATLAB using the training data and
report w and J(w). Also generate a plot depicting your training data and the
fitted line.
(c) Linear Regression: gradient descent Another way to solve linear regression
is through gradient descent (GD). In gradient descent, each iteration performs
the following update:
wj
:= wj − η
X
N
n=1
(hw(xn) − yn)xn,j (simultaneously update wj
for all j).

With each iteration of gradient descent, we expect our updated parameters w to
come closer to the optimal w
∗ and therefore achieve a lower value of J(w).
Implement the gradient descent algorithm in MATLAB using all of the following
specifications for the gradient descent algorithm:
5 ECE M146 Homework 2

• Initialize w to be the all 0s vector.
• Run the algorithm for 10000 iterations.
• Terminate the algorithm earlier if the value of J(w) is unchanged across
consecutive iterations. In this exercise, we say J(w) is unchanged if |J(w)t−1−
J(w)t
| < 0.0001.

• Use a fixed learning rate η.
For different η = 0.05, 0.001, 0.0001, 0.00001, report the final J(w) and the number
of iterations until convergence (the number will be 10000 if the algorithm did not
converge within 10000 iterations). Discuss how does the learning rate η affect the
GD algorithm.

(d) Gradient Descent Visualization Repeat the exercise in (c) with only η = 0.05
for only 40 iterations, with w initialized to the all 0s vector. Report w, J(w)
and plot the fitted line (along with the data) for iteration 0, 10, 20, 30 and 40.
Compare your fitted lines and J(w)(s) to what you get in (b). What do you
observe over different iterations?

(e) Polynomial Regression Next let us consider the more complicated case of polynomial regression, where our hypothesis is
hw(x) = w
T φ(x) = w0 + w1x + w2x
2 + · · · + wmx
m.

Note that the function is linear in the coefficient w. We can therefore extend the
result of linear regression by replacing the input matrix X with
Φ =





φ(x1)
T
φ(x2)
T
.
.
.

φ(xN )
T





where φ(x) is a function such that φj (x) = x
j
for j = 0, · · · , m.
For m = 0, · · · , 10, use the closed-form solution to determine the best-fit polynomial regression model on the training data. With this model, calculate the RMSE
(Root-Mean-Square error),i.e., ERMS =
p

J(w)/N, on both the training data and
the test data. Generate a plot depicting how RMSE (both training and testing)
varies with model complexity (polynomial degree m).

Which degree of polynomial
would you say best fits the data? Was there evidence of under/over-fitting the
data? Use your plot to justify you answer.
6 ECE M146 Homework 2