Sale!

# CSCI567 Homework 3 solution

Original price was: $35.00.Current price is:$28.00.

Category:

5/5 - (1 vote)

## 1 Logistic Regression and Regularization

Let us consider a binary logistic regression model.
(a) Given n training examples (x1, y1),(x2, y2), …,(xn, yn), please write down the negative log
likelihood (as loss function):
L(w) = − log Yn
i=1
P(Y = yi
|X = xi)
!
.

(b) Is this loss function convex? Provide your reasoning.
(c) Show that the magnitude of the optimal w can go to infinity when the training samples are
linearly seperable (i.e. the data can be perfectly classified by a linear classifier).
(d) A convenient way to resolve the problem described in (c) is to add a regularization term to
the likelihood function as follows:

L(w) = − log Yn
i=1
p(Y = yi
|X = x)
!
+ λkwk
2
2
, (1)
where kwk
2
2 =
P
i w
2
i

, i.e. ∂L(w)
∂wi
.

(e) Show that the problem in Eq. (1) has a unique solution.

## 2 Sparsity via Regularization

Given a set of training data (x1, y1), (x2, y2), · · · ,(xN , yN ) where xi ∈ RD, linear regression learns
the weight vector w (assuming the bias term is absorbed into w) by solving
min
w
X
n
(yi − w>xi)
2
(2)

For many real-world applications (e.g. text and image classification, gene sequence analysis), the
dimensionality D can be very high. In practice, however, only a small portion of features may
be actually useful in predicting the target varible y. This motivates us to enforce certain sparse
structure on w: many elements of w are zeros.

(a) One way to incorporate sparsity into linear regression is to add a regularization term as follows
min
w
X
n
(yi − w>xi)
2 + λkwk0 (3)

where λ > 0, and kwk0 = #{i : wi 6= 0} is called 0-norm of w which is equal to number of
nonzeros in w. The challenge is that kwk0 is not a convex function of w. Provide an example
to show the nonconvexity of kwk0.
1

(b) One way to overcome the nonconvexity of 0-norm is to use 1-norm: kwk1 =
P
i
|wi
| which

is the sum of absoluate value of each element. Although 1-norm is an approximation of
`0-norm, it does often lead to sparse solution in practice. Prove that the kwk1 is convex in
w.
(c) Due to the nondifferentibility of kwk1, solving
min
w
X
n
(yi − w>xi)
2 + λkwk1 (4)

becomes challenging. It no longer has an analytic solution. One method to optimize the
objective function is to transform it into a quadratic programming (QP) which has fast
solvers. A QP has the following standard form
min
u
1
2
u
>Qu + c
>u (5)

subject to Au ≤ b (6)
where u ∈ RP is the vector to be optimized (P can be different from D). c ∈ RP , b ∈ RG,
A ∈ RG×P , and Q ∈ RP ×P are coefficients. Show that (4) can be converted into a Standard
QP.

(Hint: introduce D additional variables t1, · · · , tD where td ≥ wd and td ≥ −wd for d =
1, · · · , D. Then show that minimizing kwk1 is equivalent to minimizing PD
d=1 td which is
an upper bound of kwk1.

To further transform into a standard QP, stack w and t into a
vector u ∈ R2D, and then form the corresponding Q ∈ R2D×2D, A ∈ R2D×2D, c ∈ R2D and
b ∈ R2D.)

## 3 Kernel Ridge Regression

In this problem, we will provide guidelines for you to derive kernel ridge regression, a nonlinear
extension of linear ridge regression. You will be asked to implement it in the programming part.
Given a set of training data (x1, y1), (x2, y2), · · · ,(xN , yN ) where xi ∈ RD, linear ridge regression learns the weight vector w (assuming the bias term is absorbed into w) by optimizing the
following objective function
min
w
X
n
(yi − w>xi)
2 + λkwk
2
2
(7)

where λ ≥ 0 is the regularization coefficient.
(a) Let X ∈ RN×D be a matrix whose nth row is x
>
n
. y ∈ RN is a vector whose nth element is
yn. Show that the optimal w can be written as
w∗ = (X>X + λID)
−1X>y (8)
where ID denotes the identity matrix with size d × d.
2

(b) Now we apply a nonlinear feature mapping to each sample xi → Φi = φi(x) ∈ RT where the
dimensionality T is much larger than D. Define Φ ∈ RN×T as a matrix containing all Φi
.
Show that w∗
can be written as
w∗ = Φ>(ΦΦ> + λIN )
−1y (9)

Hint: you may use the following identity for matrices. For any matrix P ∈ Rp×p
, B ∈
Rq×p
, R ∈ Rq×q and assume the matrix inversion is valid, we have
(P
−1 + B
>R−1B)
−1B
>R−1 = PB>(BPB> + R)
−1

(c) Given a testing sample φ(x), show that the prediction ˆy = w∗>φ(x) can be written as
yˆ = y
>(K + λIN )
−1κ(x) (10)
where K ∈ RN×N is a kernel matrix defined as Kij = Φi
>Φj
, κ(x) ∈ RN is a vector with
nth element κ(x)n = φ(xn)

>φ(x). Now you can see that ˆy only depends on the dot product
(or kernel value) of {Φi}.
(d) Compare the computational complexity between linear ridge regression and kernel ridge regression.

## 4 Kernel Construction

The Mercer theorem can be used to construct valid kernel functions. The theorem states that, a
bivariate function k(·, ·) is a positive definite kernel function, if and only if, for any N and any
x1, x2, · · · , xN , the matrix K, where Kij = k(xi
, xj ), is positive semidefinite.

That is, all the
eigenvalues of the matrix are non-negative. An alternative (but equivalent) definition states that,
for every positive semi-definite matrix A ∈ R
n×n and arbitrary vector x ∈ R
n×1
, we have x
>Ax ≥ 0.
Suppose k1(·, ·) and k2(·, ·) are kernel functions, and x and x
0 are any two samples, prove that
k3, k4, k5, k6 and k7 are valid kernel functions using the Mercer theorem:

(a) k3(x, x
0
) = a1k1(x, x
0
) + a2k2(x, x
0

) where a1, a2 ≥ 0.
(b) k4(x, x
0
) = f(x)f(x
0
) where f(·) is a real valued function.
(c) k5(x, x
0
) = g(k1(x, x
0

)) where g(·) is a polynomial function with positive coefficients.
(d) k6(x, x
0
) = k1(x, x
0
)k2(x, x
0
).
(e) k7(x, x
0
) = exp (k1(x, x
0
)).
3

## 5 Programming – Bias/Variance Tradeoff

Let the function f(x) = 2x
2 +, where  is Gaussian noise drawn from N (0, 0.1). We are also given
the following functions:
• g1(x) = 1
• g2(x) = w0
• g3(x) = w0 + w1x
• g4(x) = w0 + w1x + w2x
2
• g5(x) = w0 + w1x + w2x
2 + w3x
3
• g6(x) = w0 + w1x + w2x
2 + w3x
3 + w4x
4

(a) Randomly generate 100 datasets. Each dataset contains 10 samples (xi
, yi), where xi
is
uniformly sampled from [−1, 1] and yi = f(xi). For a given g function, estimate its parameters
using linear regression and compute the sum-square-error on every dataset. Then plot the
histogram of the mean-squared-error. Repeat this for each g function. Please provide the 6
histogram plots in the report. For each g function, estimate its bias2 and variance, and report
the results.

(b) In (a), change the sample size in each dataset to 100, and repeat the procedures. Plot the
resulting histograms in the report. For each g function, estimate its bias2 and variance, and
report the results.
(c) Given your results in (a) and (b), discuss how the model complexity and sample size affect
the bias2 and variance.

(d) Consider function h(x) = w0 + w1x + w2x
2 + λ(w
2
0 + w
2
1 + w
2
2
) where λ ≥ 0 is a regularization

parameter. Following (b) (i.e. 100 samples per dataset), estimate the bias2 and variance of
h(x) when λ set to 0.01, 0.1, 1, 10 respectively. Discuss how λ affect the bias2 and variance.
6 Programming – Regression
In this problem, you will implement linear ridge regression and kernel ridge regression, and work
on a regression dataset.

All programming should be done in MATLAB. You must implement certain functions which we
specify. Otherwise, you may use built-in MATLAB functions. Do not use code from the internet
or from other students. Record answers to questions in your written LaTeX report.
Dataset It contains 3,107 observations on U.S. county votes cast in the 1980 presidential election.

Each observation contains 6 features related to population structure, ecomonical condition, as well
as geographical info of the county. The goal is to predict the portion of the votes. Details of the
dataset can be found at http://lib.stat.cmu.edu/datasets/space_ga.
4

Preprocessing Randomly split the dataset into the training (80% samples) and test sets (20%
samples). Normalize each feature to have zero mean and unit variance (note that you should only
use the mean and variance information in the training data). When experimenting with any model
described below, do 10 random splits, and report the average test error.

Linear Ridge Regression Implement linear ridge regression described in Problem 2 using Matlab. For the regularization parameter λ, using 5-folder cross validation on the training set to pick
the optimal one from [0, 10−4
, 10−3
, 10−2
, · · · , 102
, 103
].

Note that for each random split of the
data, you need to run cross validation to select λ. Report the optimal λ for each ramdom split.
Additionally, report the average test error.
Kernel Ridge Regression Implement kernel ridge regression described in Problem 2 using
Matlab. You need to try three types of kernels
(a) Linear kernel: k(xi
, xj ) = x
>
i xj .

(b) Polynomial kernel: k(xi
, xj ) = (x
>
i xj + a)
b
, where a and c are additional parameters. You
need to choose a from [−1, −0.5, 0, 0.5, 1], and choose c from [1, 2, 3, 4].
(c) Gaussian RBF kernel: k(xi
, xj ) = exp 

kxi − xjk
2
2
σ
2


, where σ is an additional parameter.
You need to choose σ
2
from [0.125, 0.25, 0.5, 1, 2, 4, 8].
For each of the kernel type, use 5-folder cross validation on the training set to pick the optimal
λ from [0, 10−4
, 10−3
, 10−2
, · · · , 102
, 103
] and the other kernel parameters.

Report the optimal
parameters for each ramdom split. Additionally, report the average test error.
Comparison Does kernel ridge regression with linear kernel give the same results as linear ridge
regression? If not, why? Among three types of kernels, which one performs the best?
5

### Submission Instructions: You need to provide the followings:

• Provide your answers to problems 1-6 in PDF file, named as CSCI567 hw3 fall15.pdf.
You need to submit the homework in both hard copy (at Locker #19 at PHE, with a
box labeled as CSCI567-homework by 6pm of the deadline date) and electronic version
as pdf file on Blackboard. If you choose handwriting instead of typing all the answers,
you will get 40% points deducted.

• Submit ALL the code and report via Blackboard by 6 pm of the deadline date. The only
acceptable language is MATLAB. For your program, you MUST include the main function called CSCI567 hw3 fall15.m in the root of your folder. After running this main file,
your program should be able to generate all of the results needed for this programming
assignment, either as plots or console outputs.

You can have multiple files (i.e your subfunctions), however, the only requirement is that once we unzip your folder and execute