## Description

## 1 Back Propagation Through Time

Back propagation through time (BPTT) is a gradient-based technique for training certain types of

recurrent neural networks (RNN). Unlike feedforward neural networks, RNNs can use their internal

memory to process arbitrary sequences of inputs. BPTT begins by unfolding a recurrent neural

network through time as shown in the figure below.

Here, xt are the input sequence to the RNN and st are the hidden states of the RNN. Our goal is

to predict the output sequence yt

. For simplicity, we assume that dimension of input xt

, output

yt and hidden state st are the same as M. Moreover, we ignore all the biases in our formulation.

There are three connection weight matrices WIH, WHH and WHO for the input-hidden, hiddenhidden and hidden-output connections. The behavior of the RNN can be described as the following

dynamical system with non-linearity:

st = σ(WIHxt + WHHst−1), yt = WHOst

,

where σ(x) = 1

1+e−x is the sigmoid function.

BPTT algorithm is commonly used to train the RNN to obtain the weight matrices WIH, WHH

and WHO from training sequences {(xˆ1,…,T , yˆ1,…,T )} of length T. In this problem, we use the

squared loss as the loss function. Let y1,…,T be the prediction of your RNN, we have loss function

L(y1,…,T , yˆ1,…,T ) = 1

2

X

T

i=1

||yi − yˆi

||2

2

.

(a) As a first step, you need to compute the gradient of loss function L with respect to yt

, ∇ytL.

For this problem, you can assume that there is only one training sequence (xˆ1,…,T , yˆ1,…,T ).

(b) In this step, you need to compute the gradient of loss function L with respect to st

, ∇stL.

You should first compute ∇sT L and then express ∇stL in terms of ∇st+1L. It is where the

name BPTT comes from.

(c) Use your answer in last two steps to derive the gradient of loss function L with respect to

WIH, WHH and WHO.

(d) When the length of sequence T is large, gradient descend method usually leads to bad performance due to vanishing gradients. As a cure for the problem, leaky hidden units can be

1

used to learn long range dependency. Namely, we use the following equation to update the

hidden states:

st = (1 − τ ) · st−1 + τ · σ(WIHxt + WHHst−1),

where τ is a constant. You need to compute the gradient of loss function L with respect to

WIH, WHH and WHO when the leaky hidden units are used.

2 Kernel K-Means

In this problem, you will apply the kernel tricks to K-means algorithm to make it more powerful.

Recall that the K-means algorithm optimizes the following objective: Given a set of data points

{xn}

N

n=1, the method minimizes the following distortion measure (or objective or clustering cost):

D =

X

N

n=1

X

K

k=1

rnkkxn − µkk

2

2

where µk is the prototype of the k-th cluster. rnk is a binary indicator variable. If xn is assigned

to the cluster k, rnk is 1 otherwise rnk is 0. For each cluster, µk is the representative for all the

data points assigned to that cluster.

Now assume that we apply a mapping φ(x) to map data points into feature space. Then, we

define the objective function of kernel K-means as

D˜ =

X

N

n=1

X

K

k=1

rnkkφ(xn) − µ˜kk

2

2

,

where µ˜k is the center of the cluster k in the feature space.

(a) Show that the D˜ can be represented in terms of only kernel K(xi

, xj ) = φ(xi)

Tφ(xj).

(b) Describe and write down the equation of assigning a data point to its cluster. You answer

should only consist of kernel value K(xi

, xj ).

(c) Write down the pseudo-code of the complete kernel K-means algorithm including initialization

of cluster centers.

3 EM algorithm

Zero-inflated Poisson regression is used to model count data that has an excess of zero counts. For

example, the number of insurance claims within a population for a certain type of risk would be

zero-inflated by those people who have not taken out insurance against the risk and thus are unable

to claim. The observed data probability of observation i is:

p(xi) = (

π + (1 − π)e

−λ

if xi = 0

(1 − π)

λ

xi e−λ

xi!

if xi > 0.

Your task in this problem is to design an EM to estimate the parameter π and λ from observed

data {xi}

N

i=1.

2

(a) Define a proper hidden variable zi for the observations (Hint: you only need hidden variables

for some observations) and use them to write down the complete likelihood function.

(b) Write down the update equations for both the E-Step and the M-step.

3

4 Programming

In this problem, you will implement three different clustering methods, K-means, Kernel K-means

and Gaussian Mixture Model. You will evaluate the performance of your method on two synthetic

datasets.

4.1 Data

You are provided with two datasets, hw5 blob.mat and hw5 circle.mat. Both datasets have two

dimensions.

4.2 Implement k-means

As we studied in the class, k-means tries to minimize the following distortion measure (or objective

function):

J =

X

N

n=1

X

K

k=1

rnk||xn − µk||2

2

where rnk is an indicator variable:

rnk = 1 if and only if xn belongs to cluster k

and (µ1, . . . , µk) are the cluster centers with the same dimension of data points.

(a) Implement k-means using random initialization for cluster centers. The algorithm should run

until none of the cluster assignments are changed. Run the algorithm for different values of

K ∈ {2, 3, 5}, and plot the clustering assignments by different colors and markers. (you need

to report 6 plots, 3 for each dataset.)

(b) The k-means algorithm fails to separate the two circles in the hw5 circle.mat dataset. Please

explain why this happens.

4.3 Implement kernel k-means

Implement kernel k-means you derived in Problem 2 and evaluate it on the hw5 circle.mat dataset.

You should choose a kernel that can separate the two circles.

(a) Write down the choice of your kernel.

(b) Implement kernel k-means using random initialization for cluster centers (randomly pick

data points as centers of the clusters). The algorithm should run until none of the cluster

assignments are changed. Run the algorithm for K = 2, and plot the clustering assignments

by different colors and markers. (you need to report 1 plot)

4.4 Implement Gaussian Mixture Model

In this problem, you need to implement the EM algorithm to fit a Gaussian Mixture model on the

hw5 blob.mat dataset.

4

(a) Run 5 times of your EM algorithm with number of components K = 3, and plot the log

likelihood of the data over iterations of EM for each of runs. (The x-axis is the number of

iterations, and the y-axis is the log likelihood of the data given current model parameters.

Please plot all five curves in the same figure)

(b) For the best run in terms of log likelihood, (1) Plot the most likely cluster assignment of each

data point indicated by different colors and markers. (2) Report the mean and co-variance

matrix of all the three Gaussian components.

5 Submission Instructions

Submission Instructions: You need to provide the followings:

• Provide your answers to problems 1-4 in PDF file, named as CSCI567 hw6 fall15.pdf.

You need to submit the homework in both hard copy (at Locker #19 at PHE, with a

box labeled as CSCI567-homework by 6pm of the deadline date) and electronic version

as pdf file on Blackboard. If you choose handwriting instead of typing all the answers,

you will get 40% points deducted.

• Submit ALL the code and report via Blackboard by 6 pm of the deadline date. The only

acceptable language is MATLAB. For your program, you MUST include the main function called CSCI567 hw5 fall15.m in the root of your folder. After running this main

file, your program should be able to generate all of the results needed for this programming assignment, either as plots or console outputs.You can have multiple files (i.e your

sub-functions), however, the requirement is that once we unzip your folder and execute

your main file, your program should execute correctly. Please double-check your program

before submitting. You should only submit one .zip file. No other formats are allowed

except .zip file. Also, please name it as [lastname] [firstname] hw5 fall15.zip.

Collaboration: You may collaborate. However, collaboration has to be limited to discussion only

and you need to write your own solution and submit separately. You also need to list with

whom you have discussed.

5