## Description

## 1 [25 points] Logistic Regression

Consider the log-likelihood function for logistic regression:

ℓ(w) = X

N

i=1

y

(i)

log h(x

(i)

) + (1 − y

(i)

) log(1 − h(x

(i)

)),

where h(x) = σ(w⊤x) = 1

1 + exp(−w⊤x)

.

(a) [5 points] Find the Hessian H of ℓ(w).

(b) [6 points] Show that H is negative semi-definite and thus ℓ is concave and has no local maxima other

than the global one. That is, show that

z

⊤Hz ≤ 0

for any vector z. [Hint: You might want to start by showing the fact that P

i

P

j

zixixj zj = (x

⊤z)

2

.

Note that (x

⊤z)

2 ≥ 0.]

(c) [2 points] Using the H you calculated in part (a), write down the update rule implied by Newton’s

method for optimizing ℓ(w). [Hint: it could be a single-line equation: w =YOUR ANSWER.]

(d) [8 points] (Autograder) Now use the update rule in (c) (and not a library function) to implement Newton’s method and apply it to binary classification problem, following the guide in logistic regression.ipynb.

Your ipynb file SHOULD include all the outputs.

(e) [2 points] What are the coefficients w, including the intercept term, resulting from your code? Please

provide your answer in your writeup.

(f) [2 points] Please share the final plot from logistic regression.ipynb in your writeup.

4

## 2 [27 points] Softmax Regression via Gradient Ascent

Gradient ascent is an algorithm used to find parameters that maximize a certain expression (contrary to

gradient descent, which is used to minimize an expression). For some function f(w), gradient ascent finds

w∗ = argmaxw f(w) according to the following pseudo-code:

Algorithm 1 Gradient Ascent

1: w∗ ← random

2: repeat

3: w∗ ← w∗ + α∇wf(w∗

)

4: until convergence

5: return w∗

Softmax regression is a multiclass classification algorithm. Given a labeled dataset D = {(x

(1), y(1)),

(x

(2), y(2)), …,(x

(N)

, y(N)

)}, where y

(i) ∈ {1, 2, …, K} (total K classes), softmax regression computes the

probability that an example x belongs to a class k:

p(y = k|x, w) = exp(w⊤

k ϕ(x))

PK

j=1 exp(w⊤

j ϕ(x))

where wk is a weight vector for class k. The above expression is over-parametrized, meaning that there

is more than one unique {w1, w2, …, wK} that gives identical probability measures for p(y = k|x, w).

An

unique solution can be obtained using only K −1 weight vectors w = {w1, w2, …, wK−1} and fixing wK = 0:

p(y = k|x, w) = exp(w⊤

k ϕ(x))

1 + PK−1

j=1 exp(w⊤

j ϕ(x))

; ∀k = {1, 2, …, K − 1}

p(y = K|x, w) = 1

1 + PK−1

j=1 exp(w⊤

j ϕ(x))

We define the likelihood of the ith training example p(y

(i)

|x

(i)

, w) as:

p(y

(i)

|x

(i)

, w) = Y

K

k=1

h

p(y

(i) = k|x

(i)

, w)

iI{y

(i)=k}

where I{·} is the indicator function. We define the likelihood as:

L(w) = Y

N

i=1

p(y

(i)

|x

(i)

, w) = Y

N

i=1

Y

K

k=1

h

p(y

(i) = k|x

(i)

, w)

iI{y

(i)=k}

Finally, we define the log-likelihood as:

l(w) = log L(w) = X

N

i=1

X

K

k=1

log(h

p(y

(i) = k|x

(i)

, w)

iI{y

(i)=k}

)

(a) [11 points] Derive the gradient ascent update rule for the log-likelihood of the training data. In other

words, derive the expression ∇wml(w) for m = 1, …, K − 1. Show that:

∇wml(w) = X

N

i=1

ϕ(x

(i)

)

”

I{y

(i) = m} − exp(w⊤

mϕ(x

(i)

))

1 + PK−1

j=1 exp(w⊤

j ϕ(x(i)))#

5

=

X

N

i=1

ϕ(x

(i)

)

h

I{y

(i) = m} − p(y

(i) = m|x

(i)

, w)

i

Remember that in this notation, wk is a weight vector for class k, not the kth element of w.

[Hints: log a

b = b log(a). Further, it helps to consider the two cases separately; a case for y

(i) = k = m,

and another for y

(i) = k ̸= m, which is equivalent to using Kronecker delta δkm].

(b) [14 points] (Autograder) Using the gradient computed in part (a), implement gradient ascent for softmax regression, following the guide in softmax regression.ipynb.

You are required to implement your

code in softmax regression.py. Make sure to include all the outputs in your softmax regression.ipynb

file. Recall that softmax regression classifies an example x as:

y = argmax

y′

p(y

′

|x, w)

(c) [2 points] When you train your classifier on the training data in (b), what is the accuracy on the test

data? Please report your accuracy in your writeup.

6

## 3 [25 points] Gaussian Discriminate Analysis

Suppose we are given a dataset {(x

(i)

, y(i)

);i = 1, …, N} consisting of N independent examples, where

x

(i) ∈ RM are M-dimensional vectors, and y

(i) ∈ {0, 1}. We will model the joint distribution of (x

(i)

, y(i)

)

using:

p(y

(i)

) = ϕ

y

(i)

(1 − ϕ)

1−y

(i)

p(x

(i)

|y

(i) = 0) = 1

(2π)

M

2 |Σ|

1

2

exp(−

1

2

(x

(i) − µ0)

⊤Σ

−1

(x

(i) − µ0))

p(x

(i)

|y

(i) = 1) = 1

(2π)

M

2 |Σ|

1

2

exp(−

1

2

(x

(i) − µ1)

⊤Σ

−1

(x

(i) − µ1))

Here, the parameters of our model are ϕ, Σ, µ0, and µ1. (Note that while there are two different mean

vectors µ0 and µ1, there is only one covariance matrix Σ.)

(a) [6 points] Suppose we have already fit ϕ, Σ, µ0, and µ1, and now want to make a prediction at some

new query point x. Show that the posterior distribution of the label (y) at x takes the form of a logistic

function, and can be written as

p(y = 1|x; ϕ, Σ, µ0, µ1) = 1

1 + exp(−w⊤xb)

where xb to be M + 1 dimensional vectors (xb

(i) ∈ RM+1) by adding an extra coordinate xb0 = 1 from the

original x and w is a function of ϕ, Σ, µ0, and µ1.

(b) [14 points] The maximum likelihood estimates of ϕ, µ0, and µ1 are given by

ϕ =

1

N

X

N

i=1

I{y

(i) = 1}

µ0 =

PN

i=1 I{y

(i) = 0}x

(i)

PN

i=1 I{y

(i) = 0}

µ1 =

PN

i=1 I{y

(i) = 1}x

(i)

PN

i=1 I{y

(i) = 1}

where I{·} is the indicator function. The log-likelihood of the data is

ℓ(ϕ, µ0, µ1, Σ) = logY

N

i=1

p(x

(i)

, y(i)

; ϕ, µ0, µ1, Σ)

= logY

N

i=1

p(x

(i)

|y

(i)

; ϕ, µ0, µ1, Σ)p(y

(i)

; ϕ)

By maximizing ℓ with respect to the three parameters (ϕ, µ0, and µ1), prove that the maximum

likelihood estimates of ϕ, µ0, and µ1 are indeed the ones described above. (You may assume that there

is at least one positive and one negative example, so that the denominators in the definitions of µ0 and

µ1 above are non-zero.)

7

(c) [5 points] When M (the dimension of x

(i)

) is 1, then Σ =

σ

2

becomes a scalar, and its determinant

is |Σ| = σ

2

. By maximizing ℓ in (b) with respect to Σ, prove the maximum likelihood estimate for Σ

would become

Σ = 1

N

X

N

i=1

(x

(i) − µy(i) )(x

(i) − µy(i) )

⊤.

8

## 4 [23 points] Naive Bayes for Classifying SPAM

(a) [6 points] Naive Bayes with Bayesian Smoothing

Recall that Naive Bayes can be solved with MLE, in which we count the occurences of each feature (or

word). Adding Laplace smoothing, we get:

P(Ci) = ϕi =

N Ci

P

i

′ N Ci

′

(1)

P(xj |Ci) = µ

i

j =

N

Ci

j + α

P

j

′ N

Ci

j

′ + αK

(2)

where K is the number of classes and M is the dimension of x (total number of features or words). N

Ci

j

is the count of the occurrences of xj with class Ci

. α > 0 is the Laplace smoothing hyperparameter.

Show that Laplace smoothing is equivalent to solving the MAP estimate of Naive Bayes, where we have

a prior on the values of µ which follow a symmetric Dirichlet distribution:

P(µ) = 1

Z

Y

K

i=1

Y

M

j=1

µ

i

j

α

(3)

where Z is some normalizing constant.

[Hint: You may use the Naive Bayes likelihood and MLE derivations from lecture without proof.]

(b) In this subproblem, we will use the naive Bayes algorithm to build a SPAM classifier which we covered

in our class. Our classifier will distinguish between “real” (non-spam) emails and spam emails.

For this

problem, we obtained a set of spam emails and a set of non-spam newsgroup messages. Using only the

subject line and body of each message, the classifier will learn to distinguish between the two groups.

In this problem, we’re going to call the features tokens.

i. [9 points] (Autograder) Implement a naive Bayes classifier for spam classification, using the multinomial event model and Laplace smoothing. naive bayes spam.ipynb will work you through implementing this classifier. In this task, we will train your parameters with the training dataset

MATRIX.TRAIN.

Then, use these parameters to classify the test dataset MATRIX.TEST and

compute the accuracy using the evaluate function. Implement your code in naive bayes spam.py

and submit your naive bayes spam.py implementation with naive bayes spam.ipynb. Please include all the outputs in your naive bayes spam.ipynb file.

ii. [2 points] Some tokens may be particularly indicative of an email being spam or non-spam. One

way to measure how indicative a token i is for the SPAM class by looking at:

log

p(xj = i|y = 1)

p(xj = i|y = 0)

= log

P(tokeni

|email is SPAM)

P(tokeni

|email is NOTSPAM)

Using the parameters you obtained in part (a), find the 5 tokens that are most indicative of the

SPAM class (i.e., 5 tokens that have the highest positive value on the measure above). What are

the first five tokens from your implementation.

Please provide them in your writeup.

iii. [2 points] Repeat part (a), but with different naive Bayes classifiers each trained with different

training sets MATRIX.TRAIN.*. Evaluate the classifiers with MATRIX.TEST and report the

classification accuracy for each classifier in your writeup.

iv. [2 points] Please provide the final training data size-accuracy plot from part (c) in your writeup.

v. [2 points] Which training set size gives you the best classification accuracy? Please provide your

own analysis in your writeup.