## Description

## 1 [31 points] Derivation and Proof

(a) [8 points] Consider the linear regression problem for 1D data, where we would like to learn a function

h(x) = w1x + w0 with parameters w0 and w1 to minimize the sum squared error:

L =

1

2

X

N

i=1

(y

(i) − h(x

(i)

))2

(1)

for N pairs of data samples (x

(i)

, y(i)

). Derive the solution for w0 and w1 for this 1D case of linear

regression. Show the derivation to get the solution

w0 = Y − w1X, (2)

w1 =

1

N

PN

i=1 x

(i)y

(i) − Y X

1

N

PN

i=1(x

(i))

2 − X

2

(3)

where X is the mean of {x

(1), x(2)

, · · · , x(N)} and Y is the mean of {y

(1), y(2)

, · · · , y(N)}.

(b) [14 points] Recall the definition and property of positive (semi-)definite matrix. Let A be a real,

symmetric d × d matrix. A is positive semi-definite (PSD) if, for all z ∈ R

d

, z

⊤Az ≥ 0. A is positive

definite (PD) if, for all z ̸= 0, z

⊤Az > 0. We write A ⪰ 0 when A is PSD, and A ≻ 0 when A is PD.

It is known that every real symmetric matrix A can be factorized via the eigenvalue decomposition:

A = UΛU⊤ where U is a d × d matrix such that UU⊤ = U⊤U = I and Λ = diag(λ1, λ2, · · · , λd).

Multiplying on the right by U we see that AU = UΛ. If we let ui denote the i-th column of U, we have

Aui = λiui for each i, λi are eigenvalues of A, and the corresponding columns of U are eigenvectors

associated to λi

. The eigenvalues constitute the “spectrum” of A.

i. [6 points] Prove A is PD if and only if λi > 0 for each i (Note: if and only if means you are

asked to prove it for both directions).

ii. [8 points] Consider the linear regression problem where Φ and y are as defined in class; we saw

that the closed form solution becomes (Φ⊤Φ)−1Φ

⊤y.

Now consider a ridge regression problem with the regularization term 1

2

β∥w∥

2

2

. We know that

the symmetric matrix in the closed-form solution is Φ⊤Φ + βI. Please (i) derive the eigenvalues

and eigenvectors for Φ⊤Φ + βI with respect to the eigenvalues and eigenvectors of Φ⊤Φ denoted

as λi and ui

, and then, (ii) for any β > 0, prove that the matrix (Φ⊤Φ + βI) is PD.

(c) [9 points] In this sub-problem, we use logistic regression to predict the class label y ∈ {−1, +1}

instead of y ∈ {0, 1} as in the ordinary logistic regression. Show that maximizing the log-likelihood of

logistic regression, PN

n=1 log P(y

(n)

|x

(n)

), is equivalent to minimizing the following loss function:

X

N

n=1

log

1 + exp(−y

(n)

· w⊤ϕ(x

(n)

))

. (4)

[Hint: You can expand the log-likelihood as follows: log P(y

(n)

|x

(n)

) = I(y

(n) = 1) log P(y

(n) =

1|x

(n)

) + I(y

(n) = −1) log P(y

(n) = −1|x

(n)

) then plug in the class posterior probability of the logistic

regression model.]

4

2 [39 points] Linear regression on a polynomial

In this problem, you will implement linear regression on a polynomial. Please have a look at the accompanied

starter code linear regression.py and notebook linear regression.ipynb for instructions first. Please

note that all the sub-questions without (Autograder) need to be answered in your writeup.

Sample data The files q2xTrain.npy, q2xTest.npy, q2yTrain.npy and q2yTest.npy specify a linear

regression problem for a polynomial. q2xTrain.npy represent the inputs (x

(i) ∈ R) and q2yTrain.npy

represents the outputs (y

(i) ∈ R) of the training set, with one training example per row.

2.1 GD and SGD

You will compare the following two optimization methods, in finding the coefficients of a polynomial of

degree one (i.e. slope and intercept) that minimize the training loss.

• Batch gradient descent (GD)

• Stochastic gradient descent (SGD)

Here, as we seen in the class, the training objective is defined as:

E(w) = 1

2

X

N

i=1

X

M

j=0

wjϕj (x

(i)

) − y

(i)

2

=

1

2

X

N

i=1

w⊤ϕ(x

(i)

) − y

(i)

2

(5)

(a) [12 points] (Autograder) Implement the GD and SGD optimization methods. For all the implementation details (e.g., function signature, initialization of weight vectors, etc.), follow the instruction given

in the code files. Your score for this question will be graded by the correctness of the implementation.

(b) [3 points] Please share the plot generated from the section 2(b) of your .ipynb file in your write-up,

and then compare two optimization methods, GD and SGD. Which one takes less time and which one

shows lower test objective E(wtest)?

5

## 2.2 Over-fitting study

Next, you will investigate the problem of over-fitting. Recall the figure from lecture that explored over-fitting

as a function of the degree of the polynomial M. To evaluate this behaviour, let’s examine the Root-MeanSquare (RMS) Error is defined below. (Note: we use the RMS error just for evaluation purpose, NOT as a

training objective.)

ERMS =

p

2E(w∗)/N, (6)

Figure 1: (Sample plot) Overfitting study: Train-Test accuracy.

(a) [8 points] (Autograder) In this subquestion, we will use the closed form solution of linear regression

(assuming all the condition is met) instead of iterative optimization methods. Implement the closed

form solution for the linear regression problem.

(b) [2 points] Regenerate the above plot with the provided data. The sample training data can be

generated with M-degree polynomial features (for M = 0 . . . 9) from q2xTrain.npy and q2yTrain.npy:

we assume the feature vector is ϕ(x

(i)

) = (1, x(i)

,(x

(i)

)

2

, · · · ,(x

(i)

)M) for any values of M. For the

test curve, use the data in q2xTest.npy and q2yTest.npy. Note that the trend of your curve is not

necessarily the same as the sample plot. Attach your plot to the write-up.

(c) [2 points] In the write-up, please discuss: Which degree polynomial would you say best fits the data?

Was there evidence of under/over-fitting the data? Use your generated plots to justify your answer.

6

## 2.3 Regularization (Ridge Regression)

Finally, you will explore the role of regularization. Recall the image from lecture that explored the effect of

the regularization factor λ:

Figure 2: (Sample plot) effects of the regularization factor λ.

(a) [8 points] (Autograder) Implement the closed form solution of the ridge regression. Specifically, please

use the following regularized objective function:

1

2

X

N

i=1

(w⊤ϕ(x

(i)

) − y

(i)

)

2 +

λ

2

||w||2

2

. (7)

for optimizing the parameters w.

(b) [2 points] For the sample data, our goal is to regenerate the above plot (Figure 2), but with λ ∈

{10−5

, 10−4

, . . . , 10−1

, 100

(= 1)}). Please first compute the ERMS over the training data specified

in q2xTrain.npy and q2yTrain.npy with λ and then measure the test error with q2xTest.npy and

q2yTest.npy. Please attach your (unregularized) ERMS plot of both the training data and test data,

obtained from 2(g), to the write-up. Note that the trend of your curve is not necessarily the same as

the sample plot.

(c) [2 points] Discuss: Which λ value seemed to work the best for a ninth degree polynomial (M = 9)?

Use your generated plots to justify your answer. Please provide your answer in your write-up.

7

3 [30 points] Locally weighted linear regression

Consider a linear regression problem in which we want to weight different training examples differently.

Specifically, suppose we want to minimize

ED(w) = 1

2

X

N

i=1

r

(i)

(w⊤x

(i) − y

(i)

)

2

,

where r

(i) ∈ R is the “local” weight for the sample (x

(i)

, y(i)

). In class, we worked on its special case where all

the weights (the r

(i)

’s) are all equal. In this problem, we will generalize some of those ideas to the weighted

setting, and also implement the locally weighted linear regression algorithm. [Note: the weight r

(i)

can be

different for each of the data points in the training data.]

(a) [3 points] Show that ED(w) can also be written as

ED(w) = (w⊤X − y

⊤)R(w⊤X − y

⊤)

⊤

for an appropriate diagonal matrix R, and where X ∈ R

D×N is a matrix whose i-th column is x

(i) ∈

R

D×1 and y ∈ R

N×1

is the vector whose i-th entry is y

(i)

. Here, please note that in locally weighted

linear regression in this homework, we use raw data directly withing mapping to high-dimensional

features (i.e., each input is represented as a D-dimensional input vector, not M-dimensional feature

vector), so that’s why we use notation for X instead of Φ. State clearly what the R matrix is.

(b) [7 points] As we already saw in the class, if all the r

(i)

’s equal 1, then the normal equation for

w ∈ R

D×1 becomes

XX⊤w = Xy,

and the value of w∗

that minimizes ED(w) is given by (XX⊤)

−1Xy.

2 Now, by finding the derivative

∇wED(w) from (a) and setting that to zero, generalize the normal equation and the closed form

solution to this locally-weighted setting, and give the new value of w∗

that minimizes ED(w) in a

closed form as a function of X, R and y. (hint: ∇w(RX⊤w) = XR⊤ = XR)

(c) [8 points] Suppose we have a training set {(x

(i)

, y(i)

);i = 1 . . . , N} of N independent examples, but

in which the y

(i)

’s were observed with differing variances. Specifically, suppose that

p(y

(i)

|x

(i)

; w) = 1

√

2πσ(i)

exp

−

(y

(i) − w⊤x

(i)

)

2

2(σ

(i))

2

I.e., y

(i)

is a Gaussian random variable with mean w⊤x

(i) and variance (σ

(i)

)

2

(where the σ

(i)

’s are

fixed, known constants). Show that finding the maximum likelihood estimate (MLE) of w reduces to

solving a weighted linear regression problem ED(w). State clearly what the r

(i)

’s are in terms of the

σ

(i)

’s.

2We presented the row-major (X ∈ RN×D) version of w∗ as (X⊤X)−1X⊤y in Lecture 2. For practice, we are using the

column-major (X ∈ RD×N ) version in this problem.

8

(d) [12 points, Programming Assignment] In the following, you will use the files q3x.npy which

contains the inputs (x

(i) ∈ R

N ) and q3y.npy which contains the outputs (target) (y

(i)

) for a linear

regression problem, with one training example per row.3

i. [8 points] (Autograder) Implement the closed-form solution for locally weighted linear regression

(see the accompanied code). [[TODO in compute y space implementation: our solution assumes

that the feature dimension is linear. It would be better to make this more general in the future]]

ii. [2 points] Use the implemented locally weighted linear regression solver on this dataset (using

the weighted normal equations you derived in part (b)), and plot the data and the curve resulting

from your fit. When evaluating local regression at a query point x (which is real-valued in this

problem), use weights

r

(i) = exp

−

(x − x

(i)

)

2

2τ

2

with a bandwidth parameter τ ∈ {0.1, 0.3, 0.8, 2, 10}. Please attach the plots generated in 3.(d).ii

of your notebook to write-up.

iii. [2 points] In the write-up, discuss and comment briefly on what happens to the fit when τ is

too small or too large.

3Please note that we are using row-major version for implementation, as described in the starter package.

9