## Description

## 1 [21 + 3 points] Direct Construction of Valid Kernels

In class, we saw that by choosing a kernel k(x, z) = ϕ(x)

⊤ϕ(z), we can implicitly map data to a high

dimensional space, and have the SVM algorithm work in that space. One way to generate kernels is to

explicitly define the mapping ϕ to a higher dimensional space, and then work out the corresponding k.

However, in this question, we are interested in direct construction of kernels.

Suppose we have a function

k(x, z) that gives an appropriate similarity measure (similar to inner product) for our learning problem, and

consider plugging k into a kernelized algorithm (like SVM) as the kernel function. In order for k(x, z) to

be a valid kernel, it must correspond to an inner product in some higher dimensional space resulting from

some feature mapping ϕ. In addition, Mercer’s theorem states that k(x, z) is a (Mercer) kernel if and only

if for any finite set

x

(1)

, . . . , x

(N)

, the matrix K ∈ R

N×N given by Kij = k

x

(i)

, x

(j)

is symmetric and

positive semi-definite.

Now here comes the question: Let k1, k2 be kernels over R

D × R

D, let a ∈ R

+ be a positive real number,

let f : R

D → R be a real-valued function, let ϕ : R

D → RM be a function mapping from R

D to RM, let k3

be a kernel over RM × RM, and let p : R → R be a polynomial with positive coefficients.

For each of the functions k below, state whether it is necessarily a kernel. If you think it is a kernel,

please prove it; if you think it is not, please prove it or give a counterexample. You can use both definitions

(inner prod of feature map or PSD) when proving the validity of the kernel.

If you once proved a property

in one sub-question (e.g., you proved Q1.(a)), then it is okay to use it in other questions, e.g., Q1.(f), by

mentioning that you proved that property in Q1.(a); please make sure to provide the proof at least once.

However, you are NOT allowed to use the ‘results’ from the ‘Constructing kernels’ slide without proving

them yourself.

(a) [2 points] k(x, z) = k1(x, z) + k2(x, z)

(b) [2 points] k(x, z) = k1(x, z) − k2(x, z)

(c) [2 points] k(x, z) = −ak1(x, z)

(d) [2 points] k(x, z) = f(x)f(z)

(e) [3 points] k(x, z) = k3(ϕ(x), ϕ(z))

(f) [3 points] k(x, z) = p (k1(x, z))

(g) [3 points] k(x, z) = k1(x, z)k2(x, z)

(h) [4 points] For this sub-problem, you are not required to check if k is valid kernel. Instead, given a

kernel k(x, z) = (x

⊤z + 1)2

, find a feature map ϕ associated with k(x, z) such that k(x, z) = ϕ(x)

⊤ϕ(z).

You may assume D = 2 for this subproblem.

(i) [3 points, extra credit] Prove that the Gaussian Kernel, k(x, z) = exp

−∥x − z∥

2/2σ

2

can be

expressed as k(x, z) = ϕ(x)

⊤ϕ(z), where ϕ(·) is an infinite-dimensional vector. Specifically, find the

explicit closed form of an infinite dimensional feature vector ϕ. (Note: you cannot directly apply

Mercer’s theorem here as ϕ is an infinite dimensional vector)

[Hint 1: ∥x−z∥

2 = x

⊤x+z

⊤z−2x

⊤z. It might be useful to consider Power Series: exp(x) = P∞

n=0

1

n!

x

n.]

[Hint 2: You might find the formula in slide page 16 of Lecture 8 helpful.]

[Hint 3: Each element of ϕ(x) can be written as an explicit formula over x without any limits or infinite

summations.]

## 2 [14 points] Implementing Soft Margin SVM by Optimizing Primal Objective

Support Vector Machines (SVM) is a discriminative model for classification. Although it is possible to develop

SVMs that do K class classifications, we will restrict ourselves to binary classification in this question, where

the class label is either +1 (positive) or −1 (negative). SVM is not a probabilistic algorithm. In other words,

in its usual construction, it does not optimize a probability measure as a likelihood.

SVM tries to find the

“best” hyperplane that maximally separates the positive class from the negative class.

Recall that the objective function for maximizing the soft margin is equivalent to the following minimization problem:

min

w,b,ξ

1

2

∥w∥

2 + C

X

N

i=1

ξ

(i)

subject to y

(i)

w⊤ϕ(x

(i)

) + b

≥ 1 − ξ

(i)

, ∀i = 1, . . . , N

ξ

(i) ≥ 0, ∀i = 1, . . . , N (1)

The above is known as the primal objective of SVM. Notice the two constraints on the slack variables ξ

(i)

.

It means that ξ

(i) must satisfy both of those conditions, while minimizing the sum of ξ

(i)

’s times C.

The

constrained minimization is equivalent to following minimization involving the hinge loss term:

min

w,b

E(w, b), E(w, b) = 1

2

∥w∥

2 + C

X

N

i=1

max

0, 1 − y

(i)

w⊤ϕ(x

(i)

) + b

(2)

You will be working with minimization of the above objective in this problem.

(a) [5 points] Find the derivatives of the loss function E(w, b) with respect to the parameters w, b. Show

that:

∇wE(w, b) = w − C

X

N

i=1

I

h

y

(i)

w⊤ϕ(x

(i)

) + b

< 1

i

y

(i)ϕ(x

(i)

) (3)

∂

∂bE(w, b) = −C

X

N

i=1

I

h

y

(i)

w⊤ϕ(x

(i)

) + b

< 1

i

y

(i)

, (4)

where I[·] is the indicator function. If f(x) = max(0, x), then assume that ∂f

∂x =

(

1 x > 0

0 otherwise

(b) [7 points] (Autograder) Implement the SVM algorithm using batch gradient descent, following part

1 of soft margin svm.ipynb. In previous assignments, you have implemented gradient descent while

minimizing over one variable. Minimizing over two variables (w, b) is not different.

Both gradients are

computed from current parameter values, and then parameters are simultaneously updated.3

Example pseudocode for optimizing w and b is given by Algorithm 1 below.

3Note: it is a common mistake to implement gradient descent over multiple parameters by updating the first parameter,

then computing the derivative w.r.t second parameter using the updated value of the first parameter.

In fact, updating one

parameter then computing the gradient of the second parameter using the updated value of the first parameter, is a different

optimization algorithm, known as Coordinate Descent.

5

Algorithm 1: SVM Batch Gradient Descent

w∗ ← 0

b

∗ ← 0

for j = 1 to NumEpochs do

wgrad ← ∇wE (w∗

, b∗

)

bgrad ← ∂

∂bE (w∗

, b∗

)

w∗ ← w∗ − η(j)wgrad

b

∗ ← b

∗ − η(j)bgrad

end

return w∗

, b

∗

Throughout this question, we will set C in the previous question to C = 5, NumEpochs = 100 and the

learning rate η as a constant. i.e., η(j) = 1e

−3

).

(Remarks: In this problem, we only asked you to implement batch gradient descent for primal SVM,

but it’s quite straightforward to implement stochastic gradient descent in a very similar way (although

you may need to decay the learning rate and set the initial learning rate properly).

Unlike stochastic

gradient descent, we don’t need to decay the learning rate for convergence. )

(c) [2 points] Run gradient descent over the training data 5 times, once for each of the NumEpochs=[1,

3, 10, 30, 100]. For each run, please report your trained parameters (w, b) and the test classification

accuracy in your writeup.

6

## 3 [20 points] Asymmetric Cost SVM

Consider applying an SVM to a supervised learning problem where the cost of a false positive (mistakenly

predicting +1 when the label is -1) is different from the cost of a false negative (predicting -1 when the

label is +1). The asymmetric cost SVM models these asymmetric costs by posing the following optimization

problem:

min

w,b,ξ

1

2

w⊤w + C0

X

i:y(i)=−1

ξ

(i) + C1

X

i:y(i)=1

ξ

(i)

s.t.

y

(i)

(w⊤ϕ(x

(i)

) + b) ≥ 1 − ξ

(i)

, ∀i = 1, . . . , N

ξ

(i) ≥ 0, ∀i = 1, . . . , N

Here, C0 is the cost of a false positive; C1 is the cost of a false negative. (Both C0 and C1 are fixed, known,

constants.)

(a) [4 points] We will find the dual optimization problem. First, write down the Lagrangian. Use α

(i) and

µ

(i)

to denote the Lagrange multipliers corresponding to the two constraints (α

(i)

for the first constraint,

and µ

(i)

for the second constraint (slack variables)) in the primal optimization problem above.

L(w, b, ξ, α, µ) =

(b) [6 points] Calculate the following derivatives with respect to the primal variables:

∇wL(w, b, ξ, α, µ) =

∂L(w,b,ξ,α,µ)

∂b =

∇ξ

(i)L(w, b, ξ, α, µ) =

[Hint] You can define C

(i) as C

(i) = C0I[y

(i) = −1] + C1I[y

(i) = 1].

(c) [10 points] Find the dual optimization problem. You should write down the dual optimization problem

in the following form.

Try to simplify your answer as much as possible. In particular, to obtain full

credit, the Lagrange multipliers µ

(i)

should not appear in your simplified form of the dual.

maxα W(α) =

s.t.

7

## 4 [18 points] SVMs with Convex Optimization

In this problem you will have practice training SVMs on toy data. You will learn how to use the popular

Scikit-Learn library’s SVM implementation and how to use the convex optimization library CVXOPT to

train an SVM by directly solving the dual optimization problem.

(a) [4 points] Recall from lecture that the (kernelized) dual optimization problem of SVMs can be written

as follows.

maximize

α

X

N

n=1

α

(n) −

1

2

X

N

n=1

X

N

m=1

α

(n)α

(m)

y

(n)

y

(m)

k(x

(n)

, x

(m)

)

subject to 0 ≤ α

(n) ≤ C

X

N

n=1

α

(n)

y

(n) = 0

(5)

where there are N training examples, x

(i) ∈ R

D, y

(i) ∈ {−1, 1}, k is a kernel function k : R

D ×R

D → R,

α ∈ R

N , and C ∈ R.

To solve this problem using CVXOPT, we will use the quadratic programming solver cvxopt.solvers.qp.

Given matrices P, G, A and vectors q, h, b, CVXOPT can solve the following optimization problem (a

quadratic programming):

minimize

v

1

2

v

⊤P v + q

⊤v

subject to Gv ⪯ h

Av = b

(6)

Find values for matrices P, G, A and vectors q, h, b in terms of x

(i)

, y(i)

, k(x

(i)

, x

(j)

) and C such that

the solution of the SVM dual problem (α in Equation 5) is equal to the solution of the QVXOPT equation

(v in Equation 6).

Hint 1. A maximization problem can be the same as a minimization when applying a sign change:

minx f(x) = maxx −f(x).

Hint 2. 0 ≤ α

(n) ≤ C can be separated into two constraints: −α

(n) ≤ 0 and α

(n) ≤ C.

(b) [10 points] (Autograder) Use your derivation above to complete the code in the svm.py file.

(c) [4 points]

In the notebook, you will implement drawing a decision boundary on a simple dataset and

identify support vectors from the solutions to the optimization problems. Include the test performance on

each dataset and the generated visualizations of the cvxopt SVMs in your submission writeup (generated

in svm.ipynb).

8

## 5 [27 points] Neural Network Layer Implementation

In this problem, you will get the opportunity to implement various neural network layers. X, Y will represent

the input and the output of the layers, respectively. For this problem, we use the row-major notation here

(i.e. each row of X and Y corresponds to one data sample) to be compatible with the multi-dimensional

array structure of NumPy. Furthermore, L is the scalar valued loss function of Y.

For the Fully-Connected Layer and ReLU, you should include your derivation of the gradient in your written

solution.

Important Note: In this question, any indices involved (i, j, etc.) in the mathematical notation start

from 1, and not 0. In your code, however, you should use 0-based indexing (standard NumPy notation).

(a) [8 points] Fully-Connected Layer

In this question, you will be deriving the gradient of a fully-connected layer. For this problem, consider

X ∈ R

N×Din , where N is the number of samples in a batch. Additionally, consider a dense layer with

weight parameters of the form of W ∈ R

Din×Dout and bias parameters of the form of b ∈ R

1×Dout

.

As we saw in the lecture, we can compute the output of the layer through a simple matrix multiply

operation. Concretely, this can be seen as Y = XW+ B. In this equation, we denote the output matrix

as Y ∈ R

N×Dout and the bias matrix as B ∈ R

N×Dout where each row of B is the vector b.

(Please

note that we are using row-vector notation here to make it compatible with NumPy/PyTorch, but the

lecture slides used column-vector notation, which is standard notation for most ML methods. We hope

that the distinction is clear from the context.)

Please note, that the matrix multiply operation stated above is generally useful to compute batch

outputs (i.e. simultaneously computing the output of N samples in the batch). However, for the purposes of computing the gradient, you might first want to consider the single-sample output operation

of this fully-connected layer, which can be stated in row-major notation as y

(n) = x

(n)W + b where

y

(n) ∈ R

1×Dout and x

(n) ∈ R

1×Din for 1 ≤ n ≤ N. Here, the index “(n)” in the superscript denotes n-th

example, not the layer index, and X

(n)

i = Xn,i denotes i-th dimension of n-th example x

(n)

.

Note: it might be fruitful for you to consider this expression as a summation and then calculate the

gradient for individual parameters for a single-example case. After this, you can try to extend it to a

vector/matrix format for the involved parameters.

Now, compute the partial derivatives (in matrix form) ∂L

∂W ≜ ∇WL ∈ R

Din×Dout

,

∂L

∂b ≜ ∇bL ∈

R

1×Dout

,

∂L

∂X ≜ ∇XL ∈ R

N×Din in terms of ∂L

∂Y ≜ ∇YL ∈ R

N×Dout

.

4 For technical correctness, you

might want to start with writing the gradient with non-vectorized form involving ∂L

∂Y (n)

j

∈ R, where ∂L

∂Y (n)

j

is the gradient with respect to j-th element of n-th sample in Y with 1 ≤ n ≤ N and 1 ≤ j ≤ Dout.

Hint for ∂L

∂W

Please note that we use a “matrix/vector” notation ∂L

∂W to denote a matrix in R

Din×Dout where the

element in i-th row and j-th column is ∂L

∂Wi,j

and Wi,j is the i-th row, j-th column element of W with

1 ≤ i ≤ Din and 1 ≤ j ≤ Dout.

Here, you may want to calculate the gradient ∂L

∂Wi,j

using the formula

∂L

∂Wi,j

=

PN

n=1

∂L

∂Y (n)

j

∂Y (n)

j

∂Wi,j

.

Hint for ∂L

∂b

Please note that ∂L

∂b

is a vector in R

1×Dout . You may want to start by using the formula ∂L

∂bj

=

4≜ means ‘equal to by definition’

9

PN

n=1

PDout

m=1

∂L

∂Y (n)

m

∂Y (n)

m

∂bj

and then move on to derive ∂L

∂b

.

Hint for ∂L

∂X

Please note that ∂L

∂X

is a matrix in R

N×Din . In order to derive this gradient, you can start by using the

formula ∂L

∂X(n)

i

=

PN

n′=1

PDout

m=1

∂L

∂Y (n′)

m

∂Y (n′)

m

∂X(n)

i

Following this, you can say that ∂L

∂X(n)

i

=

PDout

m=1

∂L

∂Y (n)

m

∂Y (n)

m

∂X(n)

i

because the n-th sample in X is only related with the n-th sample in Y and Y

(n′)

m

∂X(n)

i

=0 for all n

′ ̸= n.

(b) [3 points] ReLU

Let X ∈ R

N×D be a 2-D array and Y = ReLU(X). In this case, ReLU is applied to X in an element-wise

fashion. For an element x ∈ X, the corresponding output is y = ReLU(x) = max(0, x). You can observe

that Y will have the same shape as X. Express ∂L

∂X

in terms of ∂L

∂Y

, where ∂L

∂X

has the same shape as X.

Hint: you may need to use the element-wise product to express ∂L

∂X

. Please feel free to introduce

the indicator function I[·].

(c) [14 points] (Autograder) Now, we will implement the corresponding forward and backward passes in

two_layer_net.py. More importantly, please ensure that you use vectorization to make your layer

implementations as efficient as possible. You should use two_layer_net.ipynb to help you in checking

the correctness of the your implementation.

Note that for each layer, the ipython notebook just checks the

result for one specific example. Passing these tests does not necessarily mean that your implementation

is 100% correct. Once you implement the layers properly, please implement two fully-connected layer

based classifier with a Softmax loss in two_layer_net.py and test it on MNIST dataset, following the

guide in two_layer_net.ipynb.

All your implementations should go to two_layer_net.py.

(d) [1 points] Please report the training loss with train/test accuracy plot of MNIST dataset, stored as

two_layer_net.png from two_layer_net.ipynb, in your writeup.

(e) [1 points] Please report the accuracy obtained on the test set of MNIST dataset in your writeup.

10