Sale!

EECS545 Homework 3 solved

Original price was: $35.00.Current price is: $28.00.

Category:

Description

5/5 - (1 vote)

1 [21 + 3 points] Direct Construction of Valid Kernels

In class, we saw that by choosing a kernel k(x, z) = ϕ(x)
⊤ϕ(z), we can implicitly map data to a high
dimensional space, and have the SVM algorithm work in that space. One way to generate kernels is to
explicitly define the mapping ϕ to a higher dimensional space, and then work out the corresponding k.
However, in this question, we are interested in direct construction of kernels.

Suppose we have a function
k(x, z) that gives an appropriate similarity measure (similar to inner product) for our learning problem, and
consider plugging k into a kernelized algorithm (like SVM) as the kernel function. In order for k(x, z) to
be a valid kernel, it must correspond to an inner product in some higher dimensional space resulting from
some feature mapping ϕ. In addition, Mercer’s theorem states that k(x, z) is a (Mercer) kernel if and only
if for any finite set 
x
(1)
, . . . , x
(N)

, the matrix K ∈ R
N×N given by Kij = k

x
(i)
, x
(j)

is symmetric and
positive semi-definite.
Now here comes the question: Let k1, k2 be kernels over R
D × R
D, let a ∈ R
+ be a positive real number,
let f : R
D → R be a real-valued function, let ϕ : R
D → RM be a function mapping from R
D to RM, let k3

be a kernel over RM × RM, and let p : R → R be a polynomial with positive coefficients.
For each of the functions k below, state whether it is necessarily a kernel. If you think it is a kernel,
please prove it; if you think it is not, please prove it or give a counterexample. You can use both definitions
(inner prod of feature map or PSD) when proving the validity of the kernel.

If you once proved a property
in one sub-question (e.g., you proved Q1.(a)), then it is okay to use it in other questions, e.g., Q1.(f), by
mentioning that you proved that property in Q1.(a); please make sure to provide the proof at least once.
However, you are NOT allowed to use the ‘results’ from the ‘Constructing kernels’ slide without proving
them yourself.

(a) [2 points] k(x, z) = k1(x, z) + k2(x, z)
(b) [2 points] k(x, z) = k1(x, z) − k2(x, z)
(c) [2 points] k(x, z) = −ak1(x, z)
(d) [2 points] k(x, z) = f(x)f(z)
(e) [3 points] k(x, z) = k3(ϕ(x), ϕ(z))

(f) [3 points] k(x, z) = p (k1(x, z))
(g) [3 points] k(x, z) = k1(x, z)k2(x, z)
(h) [4 points] For this sub-problem, you are not required to check if k is valid kernel. Instead, given a
kernel k(x, z) = (x
⊤z + 1)2
, find a feature map ϕ associated with k(x, z) such that k(x, z) = ϕ(x)
⊤ϕ(z).

You may assume D = 2 for this subproblem.
(i) [3 points, extra credit] Prove that the Gaussian Kernel, k(x, z) = exp
−∥x − z∥
2/2σ
2

can be
expressed as k(x, z) = ϕ(x)

⊤ϕ(z), where ϕ(·) is an infinite-dimensional vector. Specifically, find the
explicit closed form of an infinite dimensional feature vector ϕ. (Note: you cannot directly apply
Mercer’s theorem here as ϕ is an infinite dimensional vector)
[Hint 1: ∥x−z∥
2 = x
⊤x+z
⊤z−2x
⊤z. It might be useful to consider Power Series: exp(x) = P∞
n=0
1
n!
x
n.]

[Hint 2: You might find the formula in slide page 16 of Lecture 8 helpful.]
[Hint 3: Each element of ϕ(x) can be written as an explicit formula over x without any limits or infinite
summations.]

2 [14 points] Implementing Soft Margin SVM by Optimizing Primal Objective

Support Vector Machines (SVM) is a discriminative model for classification. Although it is possible to develop
SVMs that do K class classifications, we will restrict ourselves to binary classification in this question, where
the class label is either +1 (positive) or −1 (negative). SVM is not a probabilistic algorithm. In other words,
in its usual construction, it does not optimize a probability measure as a likelihood.

SVM tries to find the
“best” hyperplane that maximally separates the positive class from the negative class.
Recall that the objective function for maximizing the soft margin is equivalent to the following minimization problem:
min
w,b,ξ
1
2
∥w∥
2 + C
X
N
i=1

ξ
(i)
subject to y
(i)

w⊤ϕ(x
(i)
) + b


≥ 1 − ξ
(i)
, ∀i = 1, . . . , N
ξ
(i) ≥ 0, ∀i = 1, . . . , N (1)
The above is known as the primal objective of SVM. Notice the two constraints on the slack variables ξ
(i)
.
It means that ξ
(i) must satisfy both of those conditions, while minimizing the sum of ξ
(i)
’s times C.

The
constrained minimization is equivalent to following minimization involving the hinge loss term:
min
w,b
E(w, b), E(w, b) = 1
2
∥w∥
2 + C
X
N

i=1
max 
0, 1 − y
(i)

w⊤ϕ(x
(i)
) + b
 (2)

You will be working with minimization of the above objective in this problem.
(a) [5 points] Find the derivatives of the loss function E(w, b) with respect to the parameters w, b. Show
that:
∇wE(w, b) = w − C
X
N
i=1
I
h
y

(i)

w⊤ϕ(x
(i)
) + b

< 1
i
y
(i)ϕ(x
(i)
) (3)


∂bE(w, b) = −C
X
N
i=1
I
h
y
(i)


w⊤ϕ(x
(i)
) + b

< 1
i
y
(i)
, (4)
where I[·] is the indicator function. If f(x) = max(0, x), then assume that ∂f
∂x =
(
1 x > 0
0 otherwise

(b) [7 points] (Autograder) Implement the SVM algorithm using batch gradient descent, following part
1 of soft margin svm.ipynb. In previous assignments, you have implemented gradient descent while
minimizing over one variable. Minimizing over two variables (w, b) is not different.

Both gradients are
computed from current parameter values, and then parameters are simultaneously updated.3
Example pseudocode for optimizing w and b is given by Algorithm 1 below.
3Note: it is a common mistake to implement gradient descent over multiple parameters by updating the first parameter,
then computing the derivative w.r.t second parameter using the updated value of the first parameter.

In fact, updating one
parameter then computing the gradient of the second parameter using the updated value of the first parameter, is a different
optimization algorithm, known as Coordinate Descent.
5
Algorithm 1: SVM Batch Gradient Descent
w∗ ← 0
b
∗ ← 0
for j = 1 to NumEpochs do
wgrad ← ∇wE (w∗
, b∗
)

bgrad ← ∂
∂bE (w∗
, b∗
)
w∗ ← w∗ − η(j)wgrad
b

∗ ← b
∗ − η(j)bgrad
end
return w∗
, b

Throughout this question, we will set C in the previous question to C = 5, NumEpochs = 100 and the
learning rate η as a constant. i.e., η(j) = 1e
−3
).
(Remarks: In this problem, we only asked you to implement batch gradient descent for primal SVM,
but it’s quite straightforward to implement stochastic gradient descent in a very similar way (although
you may need to decay the learning rate and set the initial learning rate properly).

Unlike stochastic
gradient descent, we don’t need to decay the learning rate for convergence. )
(c) [2 points] Run gradient descent over the training data 5 times, once for each of the NumEpochs=[1,
3, 10, 30, 100]. For each run, please report your trained parameters (w, b) and the test classification
accuracy in your writeup.
6

3 [20 points] Asymmetric Cost SVM

Consider applying an SVM to a supervised learning problem where the cost of a false positive (mistakenly
predicting +1 when the label is -1) is different from the cost of a false negative (predicting -1 when the
label is +1). The asymmetric cost SVM models these asymmetric costs by posing the following optimization
problem:
min
w,b,ξ
1
2
w⊤w + C0
X

i:y(i)=−1
ξ
(i) + C1
X
i:y(i)=1
ξ
(i)
s.t.
y

(i)
(w⊤ϕ(x
(i)
) + b) ≥ 1 − ξ
(i)
, ∀i = 1, . . . , N
ξ
(i) ≥ 0, ∀i = 1, . . . , N
Here, C0 is the cost of a false positive; C1 is the cost of a false negative. (Both C0 and C1 are fixed, known,
constants.)

(a) [4 points] We will find the dual optimization problem. First, write down the Lagrangian. Use α
(i) and
µ
(i)
to denote the Lagrange multipliers corresponding to the two constraints (α
(i)
for the first constraint,
and µ
(i)
for the second constraint (slack variables)) in the primal optimization problem above.
L(w, b, ξ, α, µ) =

(b) [6 points] Calculate the following derivatives with respect to the primal variables:
∇wL(w, b, ξ, α, µ) =
∂L(w,b,ξ,α,µ)
∂b =
∇ξ

(i)L(w, b, ξ, α, µ) =
[Hint] You can define C
(i) as C
(i) = C0I[y
(i) = −1] + C1I[y
(i) = 1].
(c) [10 points] Find the dual optimization problem. You should write down the dual optimization problem
in the following form.

Try to simplify your answer as much as possible. In particular, to obtain full
credit, the Lagrange multipliers µ
(i)
should not appear in your simplified form of the dual.
maxα W(α) =
s.t.
7

4 [18 points] SVMs with Convex Optimization

In this problem you will have practice training SVMs on toy data. You will learn how to use the popular
Scikit-Learn library’s SVM implementation and how to use the convex optimization library CVXOPT to
train an SVM by directly solving the dual optimization problem.
(a) [4 points] Recall from lecture that the (kernelized) dual optimization problem of SVMs can be written
as follows.
maximize
α
X
N

n=1
α
(n) −
1
2
X
N
n=1
X
N

m=1
α
(n)α
(m)
y
(n)
y
(m)
k(x
(n)

, x
(m)
)
subject to 0 ≤ α
(n) ≤ C
X
N
n=1
α
(n)
y
(n) = 0
(5)

where there are N training examples, x
(i) ∈ R
D, y
(i) ∈ {−1, 1}, k is a kernel function k : R
D ×R
D → R,
α ∈ R
N , and C ∈ R.

To solve this problem using CVXOPT, we will use the quadratic programming solver cvxopt.solvers.qp.
Given matrices P, G, A and vectors q, h, b, CVXOPT can solve the following optimization problem (a
quadratic programming):
minimize
v
1
2
v
⊤P v + q
⊤v
subject to Gv ⪯ h
Av = b
(6)

Find values for matrices P, G, A and vectors q, h, b in terms of x
(i)
, y(i)
, k(x
(i)
, x
(j)
) and C such that
the solution of the SVM dual problem (α in Equation 5) is equal to the solution of the QVXOPT equation
(v in Equation 6).

Hint 1. A maximization problem can be the same as a minimization when applying a sign change:
minx f(x) = maxx −f(x).
Hint 2. 0 ≤ α
(n) ≤ C can be separated into two constraints: −α
(n) ≤ 0 and α
(n) ≤ C.
(b) [10 points] (Autograder) Use your derivation above to complete the code in the svm.py file.
(c) [4 points]

In the notebook, you will implement drawing a decision boundary on a simple dataset and
identify support vectors from the solutions to the optimization problems. Include the test performance on
each dataset and the generated visualizations of the cvxopt SVMs in your submission writeup (generated
in svm.ipynb).
8

5 [27 points] Neural Network Layer Implementation

In this problem, you will get the opportunity to implement various neural network layers. X, Y will represent
the input and the output of the layers, respectively. For this problem, we use the row-major notation here
(i.e. each row of X and Y corresponds to one data sample) to be compatible with the multi-dimensional
array structure of NumPy. Furthermore, L is the scalar valued loss function of Y.

For the Fully-Connected Layer and ReLU, you should include your derivation of the gradient in your written
solution.
Important Note: In this question, any indices involved (i, j, etc.) in the mathematical notation start
from 1, and not 0. In your code, however, you should use 0-based indexing (standard NumPy notation).
(a) [8 points] Fully-Connected Layer

In this question, you will be deriving the gradient of a fully-connected layer. For this problem, consider
X ∈ R
N×Din , where N is the number of samples in a batch. Additionally, consider a dense layer with
weight parameters of the form of W ∈ R
Din×Dout and bias parameters of the form of b ∈ R
1×Dout
.

As we saw in the lecture, we can compute the output of the layer through a simple matrix multiply
operation. Concretely, this can be seen as Y = XW+ B. In this equation, we denote the output matrix
as Y ∈ R
N×Dout and the bias matrix as B ∈ R
N×Dout where each row of B is the vector b.

(Please
note that we are using row-vector notation here to make it compatible with NumPy/PyTorch, but the
lecture slides used column-vector notation, which is standard notation for most ML methods. We hope
that the distinction is clear from the context.)

Please note, that the matrix multiply operation stated above is generally useful to compute batch
outputs (i.e. simultaneously computing the output of N samples in the batch). However, for the purposes of computing the gradient, you might first want to consider the single-sample output operation
of this fully-connected layer, which can be stated in row-major notation as y
(n) = x
(n)W + b where
y

(n) ∈ R
1×Dout and x
(n) ∈ R
1×Din for 1 ≤ n ≤ N. Here, the index “(n)” in the superscript denotes n-th
example, not the layer index, and X
(n)

i = Xn,i denotes i-th dimension of n-th example x
(n)
.
Note: it might be fruitful for you to consider this expression as a summation and then calculate the
gradient for individual parameters for a single-example case. After this, you can try to extend it to a
vector/matrix format for the involved parameters.

Now, compute the partial derivatives (in matrix form) ∂L
∂W ≜ ∇WL ∈ R
Din×Dout
,
∂L
∂b ≜ ∇bL ∈
R
1×Dout
,

∂L
∂X ≜ ∇XL ∈ R
N×Din in terms of ∂L
∂Y ≜ ∇YL ∈ R
N×Dout
.

4 For technical correctness, you
might want to start with writing the gradient with non-vectorized form involving ∂L
∂Y (n)
j
∈ R, where ∂L
∂Y (n)
j
is the gradient with respect to j-th element of n-th sample in Y with 1 ≤ n ≤ N and 1 ≤ j ≤ Dout.
Hint for ∂L
∂W

Please note that we use a “matrix/vector” notation ∂L
∂W to denote a matrix in R
Din×Dout where the
element in i-th row and j-th column is ∂L
∂Wi,j
and Wi,j is the i-th row, j-th column element of W with
1 ≤ i ≤ Din and 1 ≤ j ≤ Dout.

Here, you may want to calculate the gradient ∂L
∂Wi,j
using the formula
∂L
∂Wi,j
=
PN

n=1
∂L
∂Y (n)
j
∂Y (n)
j
∂Wi,j
.

Hint for ∂L
∂b
Please note that ∂L
∂b
is a vector in R
1×Dout . You may want to start by using the formula ∂L
∂bj
=
4≜ means ‘equal to by definition’
9

PN
n=1
PDout
m=1
∂L
∂Y (n)
m

∂Y (n)
m
∂bj
and then move on to derive ∂L
∂b
.

Hint for ∂L
∂X
Please note that ∂L
∂X
is a matrix in R
N×Din . In order to derive this gradient, you can start by using the
formula ∂L
∂X(n)
i
=

PN
n′=1
PDout
m=1

∂L
∂Y (n′)
m
∂Y (n′)
m
∂X(n)
i
Following this, you can say that ∂L
∂X(n)
i
=

PDout
m=1
∂L
∂Y (n)
m
∂Y (n)
m
∂X(n)
i

because the n-th sample in X is only related with the n-th sample in Y and Y
(n′)
m
∂X(n)
i
=0 for all n
′ ̸= n.
(b) [3 points] ReLU
Let X ∈ R

N×D be a 2-D array and Y = ReLU(X). In this case, ReLU is applied to X in an element-wise
fashion. For an element x ∈ X, the corresponding output is y = ReLU(x) = max(0, x). You can observe
that Y will have the same shape as X. Express ∂L
∂X
in terms of ∂L
∂Y
, where ∂L
∂X
has the same shape as X.

Hint: you may need to use the element-wise product to express ∂L
∂X
. Please feel free to introduce
the indicator function I[·].

(c) [14 points] (Autograder) Now, we will implement the corresponding forward and backward passes in
two_layer_net.py. More importantly, please ensure that you use vectorization to make your layer
implementations as efficient as possible. You should use two_layer_net.ipynb to help you in checking
the correctness of the your implementation.

Note that for each layer, the ipython notebook just checks the
result for one specific example. Passing these tests does not necessarily mean that your implementation
is 100% correct. Once you implement the layers properly, please implement two fully-connected layer
based classifier with a Softmax loss in two_layer_net.py and test it on MNIST dataset, following the
guide in two_layer_net.ipynb.

All your implementations should go to two_layer_net.py.
(d) [1 points] Please report the training loss with train/test accuracy plot of MNIST dataset, stored as
two_layer_net.png from two_layer_net.ipynb, in your writeup.
(e) [1 points] Please report the accuracy obtained on the test set of MNIST dataset in your writeup.
10