Description
1 Back Propagation Through Time
Back propagation through time (BPTT) is a gradient-based technique for training certain types of
recurrent neural networks (RNN). Unlike feedforward neural networks, RNNs can use their internal
memory to process arbitrary sequences of inputs. BPTT begins by unfolding a recurrent neural
network through time as shown in the figure below.
Here, xt are the input sequence to the RNN and st are the hidden states of the RNN. Our goal is
to predict the output sequence yt
. For simplicity, we assume that dimension of input xt
, output
yt and hidden state st are the same as M. Moreover, we ignore all the biases in our formulation.
There are three connection weight matrices WIH, WHH and WHO for the input-hidden, hiddenhidden and hidden-output connections. The behavior of the RNN can be described as the following
dynamical system with non-linearity:
st = σ(WIHxt + WHHst−1), yt = WHOst
,
where σ(x) = 1
1+e−x is the sigmoid function.
BPTT algorithm is commonly used to train the RNN to obtain the weight matrices WIH, WHH
and WHO from training sequences {(xˆ1,…,T , yˆ1,…,T )} of length T. In this problem, we use the
squared loss as the loss function. Let y1,…,T be the prediction of your RNN, we have loss function
L(y1,…,T , yˆ1,…,T ) = 1
2
X
T
i=1
||yi − yˆi
||2
2
.
(a) As a first step, you need to compute the gradient of loss function L with respect to yt
, ∇ytL.
For this problem, you can assume that there is only one training sequence (xˆ1,…,T , yˆ1,…,T ).
(b) In this step, you need to compute the gradient of loss function L with respect to st
, ∇stL.
You should first compute ∇sT L and then express ∇stL in terms of ∇st+1L. It is where the
name BPTT comes from.
(c) Use your answer in last two steps to derive the gradient of loss function L with respect to
WIH, WHH and WHO.
(d) When the length of sequence T is large, gradient descend method usually leads to bad performance due to vanishing gradients. As a cure for the problem, leaky hidden units can be
1
used to learn long range dependency. Namely, we use the following equation to update the
hidden states:
st = (1 − τ ) · st−1 + τ · σ(WIHxt + WHHst−1),
where τ is a constant. You need to compute the gradient of loss function L with respect to
WIH, WHH and WHO when the leaky hidden units are used.
2 Kernel K-Means
In this problem, you will apply the kernel tricks to K-means algorithm to make it more powerful.
Recall that the K-means algorithm optimizes the following objective: Given a set of data points
{xn}
N
n=1, the method minimizes the following distortion measure (or objective or clustering cost):
D =
X
N
n=1
X
K
k=1
rnkkxn − µkk
2
2
where µk is the prototype of the k-th cluster. rnk is a binary indicator variable. If xn is assigned
to the cluster k, rnk is 1 otherwise rnk is 0. For each cluster, µk is the representative for all the
data points assigned to that cluster.
Now assume that we apply a mapping φ(x) to map data points into feature space. Then, we
define the objective function of kernel K-means as
D˜ =
X
N
n=1
X
K
k=1
rnkkφ(xn) − µ˜kk
2
2
,
where µ˜k is the center of the cluster k in the feature space.
(a) Show that the D˜ can be represented in terms of only kernel K(xi
, xj ) = φ(xi)
Tφ(xj).
(b) Describe and write down the equation of assigning a data point to its cluster. You answer
should only consist of kernel value K(xi
, xj ).
(c) Write down the pseudo-code of the complete kernel K-means algorithm including initialization
of cluster centers.
3 EM algorithm
Zero-inflated Poisson regression is used to model count data that has an excess of zero counts. For
example, the number of insurance claims within a population for a certain type of risk would be
zero-inflated by those people who have not taken out insurance against the risk and thus are unable
to claim. The observed data probability of observation i is:
p(xi) = (
π + (1 − π)e
−λ
if xi = 0
(1 − π)
λ
xi e−λ
xi!
if xi > 0.
Your task in this problem is to design an EM to estimate the parameter π and λ from observed
data {xi}
N
i=1.
2
(a) Define a proper hidden variable zi for the observations (Hint: you only need hidden variables
for some observations) and use them to write down the complete likelihood function.
(b) Write down the update equations for both the E-Step and the M-step.
3
4 Programming
In this problem, you will implement three different clustering methods, K-means, Kernel K-means
and Gaussian Mixture Model. You will evaluate the performance of your method on two synthetic
datasets.
4.1 Data
You are provided with two datasets, hw5 blob.mat and hw5 circle.mat. Both datasets have two
dimensions.
4.2 Implement k-means
As we studied in the class, k-means tries to minimize the following distortion measure (or objective
function):
J =
X
N
n=1
X
K
k=1
rnk||xn − µk||2
2
where rnk is an indicator variable:
rnk = 1 if and only if xn belongs to cluster k
and (µ1, . . . , µk) are the cluster centers with the same dimension of data points.
(a) Implement k-means using random initialization for cluster centers. The algorithm should run
until none of the cluster assignments are changed. Run the algorithm for different values of
K ∈ {2, 3, 5}, and plot the clustering assignments by different colors and markers. (you need
to report 6 plots, 3 for each dataset.)
(b) The k-means algorithm fails to separate the two circles in the hw5 circle.mat dataset. Please
explain why this happens.
4.3 Implement kernel k-means
Implement kernel k-means you derived in Problem 2 and evaluate it on the hw5 circle.mat dataset.
You should choose a kernel that can separate the two circles.
(a) Write down the choice of your kernel.
(b) Implement kernel k-means using random initialization for cluster centers (randomly pick
data points as centers of the clusters). The algorithm should run until none of the cluster
assignments are changed. Run the algorithm for K = 2, and plot the clustering assignments
by different colors and markers. (you need to report 1 plot)
4.4 Implement Gaussian Mixture Model
In this problem, you need to implement the EM algorithm to fit a Gaussian Mixture model on the
hw5 blob.mat dataset.
4
(a) Run 5 times of your EM algorithm with number of components K = 3, and plot the log
likelihood of the data over iterations of EM for each of runs. (The x-axis is the number of
iterations, and the y-axis is the log likelihood of the data given current model parameters.
Please plot all five curves in the same figure)
(b) For the best run in terms of log likelihood, (1) Plot the most likely cluster assignment of each
data point indicated by different colors and markers. (2) Report the mean and co-variance
matrix of all the three Gaussian components.
5 Submission Instructions
Submission Instructions: You need to provide the followings:
• Provide your answers to problems 1-4 in PDF file, named as CSCI567 hw6 fall15.pdf.
You need to submit the homework in both hard copy (at Locker #19 at PHE, with a
box labeled as CSCI567-homework by 6pm of the deadline date) and electronic version
as pdf file on Blackboard. If you choose handwriting instead of typing all the answers,
you will get 40% points deducted.
• Submit ALL the code and report via Blackboard by 6 pm of the deadline date. The only
acceptable language is MATLAB. For your program, you MUST include the main function called CSCI567 hw5 fall15.m in the root of your folder. After running this main
file, your program should be able to generate all of the results needed for this programming assignment, either as plots or console outputs.You can have multiple files (i.e your
sub-functions), however, the requirement is that once we unzip your folder and execute
your main file, your program should execute correctly. Please double-check your program
before submitting. You should only submit one .zip file. No other formats are allowed
except .zip file. Also, please name it as [lastname] [firstname] hw5 fall15.zip.
Collaboration: You may collaborate. However, collaboration has to be limited to discussion only
and you need to write your own solution and submit separately. You also need to list with
whom you have discussed.
5