Sale!

COMPSCI 527 Homework 3 solved

Original price was: $35.00.Current price is: $30.00.

Download Details:

  • Name: hw3-s4bimz.zip
  • Type: zip
  • Size: 787.00 B

Category:

Description

5/5 - (3 votes)

1. The K-means algorithm is described in Section 13.4.4 of the textbook. It takes I data points Z = {z1, . . . , zI } in R
d
and K points
m1, . . . , mK in R
d
and computes a local minimum Sˆ of the function
f(S) = X
K
k=1
min
µk∈R
d
X
z∈Sk
kz − µkk
2
where S = {S1, . . . , SK} varies over the set S of all partitions of Z. The partition Sˆ is a local minimum in the sense that no move of a
single point zi from one set of Sˆ to another yields a value of f that is lower than f(Sˆ), nor does any infinitesimal change of the resulting
centroids µ1
, . . . , µK. See algorithm 1 for a pseudo-code listing of the K-means algorithm.
Algorithm 1. The K-means algorithm
Input: z1, . . . , zI , m1, . . . , mK
for k = 1, . . . , K do
µk ← mk . Initialize the means
end for
fc ← ∞ . fc will become the current value of f(S)
repeat
for k = 1, . . . , K do
Sk ← {z | z ∈ Z and kz − µkk
2 ≤ kz − µ`k
2
for ` 6= k} . Partition the points in Z by their closest mean
end for
for k = 1, . . . , K do
if Sk 6= ∅ then
µk ← 1
|Sk|
P
z∈Sk
z . µk
is the centroid of Sk
end if
end for
fp ← fc . Remember the previous value of f(S)
fc ← f({S1, . . . , SK}) . Compute the current value of f(S)
until fp ≤ fc + τ . Stop if improvement goes below threshold τ
Output: µ1
, . . . , µK, S1, . . . , SK
(a) Implement from scratch a MATLAB function
function [M, R] = kmeans(Z, M)
that takes a d × I matrix Z whose columns are the points in Z and a d × K matrix M whose columns are the initial means mk and
runs the K-means algorithm. The output d × K matrix M has the final means µk
as its columns, and the K × I logical matrix R
has entries
R(k, i) = 
true if zi ∈ Sk
false otherwise
COMPSCI 527 — Duke — October 20, 2014
Use a termination threshold (see the until clause in algorithm 1)
τ = sqrt(eps)
where eps is a predefined constant in MATLAB. [Warning: If you compare this matrix with what the textbook calls the responsibilities in chapter 7, then your matrix is the transpose of that: R(k, i) = rik.]
Hand in your implementation of kmeans and a plot of the result of running
load data
[M, R] = kmeans(blobs, M0);
showClusters(blobs, M, R, M0, 1, ’blobsKmeans’);
The function showClusters is provided with this assignment and will produce a PDF file named blobsKmeans.pdf. Also
turn in the values in M.
(b) Run your code on the cigars data in data.mat using M0(:, 1:2) to initialize. Show the output matrix M and your results
with showClusters.
(c) Run your code on the cigars data in data.mat again, this time using M1 (also provided) to initialize. Show the output matrix
M and your results with showClusters.
(d) Explain in what way the results in your previous two answers show that kmeans does not work well with these data points. In
particular, are these shortcomings due to the fact that the K-means algorithm computes a local (as opposed to global) minimum?
If so, how? If not, what else is responsible?
(e) The K-means algorithm gets trivially in trouble when two or more of the centroids in the initial matrix M0 are equal to each other.
In this way, the coinciding centroids define exactly the same partition, and they never split apart. The resulting behavior is exactly
the same as if the duplicate means were removed, and the algorithm effectively returns a result for a lower value of K. However,
this is not the only way in which poor initialization leads to bad results.
Construct a set Z of points on the plane and a partition S = {S1, S2, S3, S4} of Z such that the following properties hold:
1. There are four tight, well-separated clusters C1, C2, C3, C4 in Z, and the cluster centroids are µ

1
, µ

2
, µ

3
, µ

4
.
2. If mk is the centroid of Sk, then the set {m1, m2, m3, m4} is different from the set {µ

1
, µ

2
, µ

3
, µ

4}.
3. If the K-means algorithm is run on Z with initial centroids M0 = [m1, m2, m3, m4], then the algorithm stops after a single
iteration of the repeat loop in algorithm 1 with
µ1 = m1 , µ2 = m2 , µ3 = m3 , µ4 = m4
(in words, the algorithm makes no progress).
Describe your construction clearly and succinctly (a drawing may help), and give the coordinates of the points in S1, S2, S3, S3,
of their centroids m1, m2, m3, m4, and of the centroids µ

1
, µ

2
, µ

3
, µ

4 of the clusters C1, C2, C3, C4. If you prefer, give
formulas for all these coordinates. The clusters need not be large, but each of them must have more than two points. Explain why
the algorithm makes no progress with your input. [Hints: It is possible to construct an example with three clusters, but with four
it’s easier to draw. Symmetry helps. There are many possible answers. You may want to use your kmeans function to verify that
your construction is correct. However, please do not hand in the result of this check.]
(f) In what way does the construction in the previous question highlight a problem with the K-means algorithm?
2. The EM density estimation algorithm for mixtures of normal distributions is described in section 7.4.2 of the textbook. It takes I data
points Z = {z1, . . . , zI } in R
d
; K points m1, . . . , mK in R
d
; and K positive real scalars σ
2
1
, . . . , σ2
K and computes a local maximum
θˆ of the likelihood function
F(θ) = Y
I
i=1
p(zi
|θ) where p(zi
|θ) = X
K
k=1
λkNormzi
[µk
, Σk]
is a mixture of normals and where θ = [λ1, . . . , λK, µ1
, . . . , µK, Σ1, . . . , ΣK]
T
is the parameter vector. The vector θˆ is a local
maximum in the sense that no infinitesimal change of θ away from θˆ produces an increase of F(θ).
There is a numerical twist to this algorithm. The likelihood F(θ) is the product of positive terms: all the normals are positive, and
not all λk can be zero (because they add up to one). However, the values of the normals can be close to zero if zi
is far from µk
for all k
COMPSCI 527 — Duke — October 20, 2014
where λk 6= 0. In that case, p(zi
|θ) ≈ 0, and if there are enough such small values, their product can cause numerical underflow. That
is, the exact value of F(θ) is never zero, but it may bee too small to be stored in a double variable, and the EM algorithm will make
no progress as a consequence.
To prevent this problem, we instead minimize the negative logarithm of F(θ),
f(θ) = − log F(θ) = −
X
I
i=1
logX
K
k=1
λkNormzi
[µk
, Σk] .
The minus sign is not important. It is introduced to eliminate the minus signs that arise when taking the logarithm of a Gaussian
distribution. In addition, the minus sign makes EM a minimization problem, just like K-means.
The important part from a numerical standpoint is the logarithm in the definition of f(θ). The number, say, 10−500 is too small to
be stored in a double variable, but its (natural) logarithm −500 log 10 is about -1,151, and that is not too small. In industrial-quality
code, one would have to make sure that EM still works when even f(θ) is too small, but we will not worry about this in this assignment.
You may assume that a single value of p(zi
|θ) can be represented in a double variable, even if their product for i = 1, . . . , I cannot.
See algorithm 2 for a pseudo-code listing of the EM algorithm, cast as a local minimization algorithm.
Algorithm 2. The EM algorithm
Input: z1, . . . , zI , m1, . . . , mK, σ2
1
, . . . , σ2
K
for k = 1, . . . , K do
λk ← 1/K . Initialize the mixture coefficients to an uninformative prior
µk ← mk . Initialize the means
Σk ← σ
2
k
Id . Initialize the covariances. Id is the d × d identity
end for
fc ← ∞ . fc will become the current value of f(θ)
repeat
for k = 1, . . . , K do
for i = 1, . . . , I do
rki ←
λkNormzi
[µk,Σk]
PK
j=1 λjNormzi
[µj
,Σj ]
. Compute the responsibilities. This is the E step
end for
end for
for k = 1, . . . , K do . The three assignments in this loop are the M step
λk ← 1
I
PI
i=0 rki . λk is the coefficient of the k-th mixture component
µk ←
PI
i=1 P
rkizi
I
i=1 rki
. µk
is the sample mean of the k-th mixture component
Σk ←
PI
i=1 rki(zi−µk)(zi−µk)
T
PI
i=1 rki
. Σk is the sample covariance of the k-th mixture component
end for
fp ← fc . Remember the previous value of f(θ)
fc ← f(λ1, . . . , λK, µ1
, . . . , µK, Σ1, . . . , ΣK) . Compute the current value of f(θ)
until fp ≥ fc + τ . Stop if improvement falls below τ
Output: λ1, . . . , λK, µ1
, . . . , µK, Σ1, . . . , Σk, r11, . . . , rKI
(a) Implement from scratch a MATLAB function
[lambda, M, Sigma, R] = EM(Z, M, sigma2)
that takes the same arguments Z and M as kmeans in addition to a vector sigma2 of K positive scalars to initialize EM. Your
function EM uses the variances σ
2
k
in sigma2 to generate the initial values of the d × d mixture covariance matrices as
Σk = sigma2(k) ∗ eye(d)
where eye(d) generates the d × d identity matrix and d=size(Z,1). The output lambda of EM is a K-dimensional vector
with the mixture coefficients λk. The output M is a d × K matrix that collects the estimated centroids of the mixture components.
The output Sigma is a d × d × K array such that
Sigma(:, :, k) = Σk ,
the full k-th covariance matrix estimated by the EM algorithm. The K × I output matrix R contains the responsibilities computed
by EM. Use the same termination threshold as in kmeans.
COMPSCI 527 — Duke — October 20, 2014
[Warning: If you compare this matrix with the responsibilities the textbook defines in chapter 7, then your matrix is the transpose
of that: R(k, i) = rik.]
Hand in your implementation of EM and a plot of the result of running
load data
[lambda, M, Sigma, R] = EM(blobs, M0, ones(1, 3));
showMixture(blobs, lambda, M, Sigma, R, M0, 1, ’blobsEM’)
The function showMixtures is provided with this assignment, and will produce a PDF file named blobsEM.pdf. The
resulting plot is in color, so you see more clearly what happens on screen, but it is OK to turn in a black-and-white print. Also
give the resulting values of lambda, M, and Sigma.
(b) Run your function EM with first argument cigars, initializing with M0(:, 1:2).
(c) Is this result more satisfactory than using K-means on the same data? Explain briefly.
(d) With either K-means or EM you need to know the number of clusters ahead of time. Show plots (using showMixture) of
running your EM function on the blobs data with K = 2 and then with K = 4. Use M0(:, 1:2)) and M04 (provided) to
initialize.
(e) Run your function EM with first argument bananas (provided), initializing with M0(:, 1:2). Turn in M and the plot generated
with showMixture.
(f) In what ways is the result in your answer to the previous question unsatisfactory? Is this a limitation of the EM algorithm or of
modeling data with a mixture of normal distributions? Explain briefly.
3. The issues with EM (and therefore with K-means) highlighted by some of your answers in the previous problem motivate us to find
a clustering method (that is, an algorithm that groups data into tight groups for some definition of “tight”) that can detect non-convex
clusters and does not require knowing the number of clusters ahead of time. One such method is based on the mean shift mode-seeking
algorithm introduced by K. Fukunaga and L. D. Hostetler (IEEE Transactions on Information Theory, 21(1):32-40, 1975). Mean shift is
not a clustering algorithm in itself, but can be used to make one. So please bear with me.
Let Kh(z) be a Gaussian kernel defined as follows:
Kh(z) = e
−(
kzk
h )
2
where h > 0 is called the bandwidth of the kernel (Kh(z) is not a probability density, as it is not normalized to integrate to one). Given
a set Z = {z1, . . . , zI } of data points in R
d
, the quantity
φh(z) = X
I
i=1
Kh(z − zi) ,
is a measure of the local density of the data in a neighborhood of z: If there are many data points zi near z, then the value of φh(z) is
large. The vector
µh
(z) =
PI
i=1 ziKh(z − zi)
PI
i=1 Kh(z − zi)
is an average of the data zi weighted by a decreasing function of their distance from z, and can therefore be interpreted as the local
centroid (or mean) of the data around z. If the data is locally Gaussian, the density at the centroid µh
(z) is no less than the density at z.
Figure 1 illustrates this point.
Thus, we have found (i) a way to measure the data density around any point z ∈ R
d
and (ii) a new point z
0 = µh
(z) where the
density is equal to or greater than that at z. These observations yield an astonishingly simple algorithm that seeks the mode of the density
of the data in Z: Start anywhere (at some initial point zstart) and keep shifting from z to the local mean µh
(z) (which becomes the
new z), until z and µh
(z) coincide. Algorithm 3 summarizes this procedure. Of course, this algorithm is local, and which mode it finds
depends on zstart. In addition, what “local” means depends on the value of the bandwidth parameter h.
(a) Implement from scratch a MATLAB function with header
function [z, zh] = meanShift(zstart, Z, h)
COMPSCI 527 — Duke — October 20, 2014
z
z’
Figure 1. The large circle suggests a Gaussian function centered at z (small hollow circle). The distribution of points (black dots) under
this Gaussian is lopsided, and the weighted mean z
0
(small hollow square) of the data does not coincide with z. If z is moved (arrow) to
point z
0
, then the local density increases.
Algorithm 3. The mean shift algorithm
Input: z1, . . . , zI , zstart, h > 0
z
0 ← zstart
repeat
z ← z
0
z
0 ←
PI
i=1 ziKh(z−zi)
PI
i=1 Kh(z−zi)
until kz − z
0k ≤ τ . Stop if there is not enough improvement
Output: z
0
that computes a local mode of the data in a set Z starting at some starting point zstart, and with bandwidth h. The input zstart
is a d×1 vector with the starting point; the d×I input matrix Z contains I data points in its columns; and the positive input scalar
h is the bandwidth. The output z is a d×1 vector with the local mode of the data. If the algorithm takes k steps to reach the mode,
then the output zh is a d × k matrix that records the path traversed from zstart to z.
Use termination threshold
τ = h/1000 .
Hand in your code and the two plots resulting from the calls
zstart = [0 1.5]’;
[z1, zh1] = meanShift(zstart, blobs, 0.2);
[z2, zh2] = meanShift(zstart, blobs, 2);
showPaths({zh1, zh2}, blobs, ’paths’);
The function showPaths is provided with this assignment, and will produce a PDF file named paths.pdf. It is OK to hand
in a black-and-white print. Also state how many iterations it took for each of the two paths.
(b) Explain the apparently unintuitive result for the greater bandwidth h = 2.
4. The mean shift algorithm can be used to construct a mean shift clustering algorithm. The recipe is again very simple: run the mean
shift algorithm I times, starting at each of the data points in turn, and record the final mode for each run. Points that are in the same
cluster (at a scale measured by the bandwidth h of the mean shift algorithm) will converge to the same mode, so we can group all the
points in Z by the resulting modes: if two points lead to the same mode, they belong to the same cluster. Algorithm 4 summarizes.
(a) Write a MATLAB function with header
function [U, R] = meanShiftCluster(Z, h)
that takes a d × I array Z of data and a positive bandwidth h and returns a d × K array U with K modes and a logical K × I array
R of cluster memberships, in the same style as kmeans. K is the number of clusters that meanShiftCluster finds.
COMPSCI 527 — Duke — October 20, 2014
Algorithm 4. The mean shift algorithm
Input: Z = {z1, . . . , zI } and h > 0
for i = 1, . . . , I do
M(:, i) ← meanShift(zi) . meanShift terminates when there is less than τ motion
end for
U ← near-unique columns(M) . Two columns in M are the same if their Euclidean distance is less than τ /10
K ← number of columns(U) . The algorithm determines the number of clusters on its own
for k = 1, . . . , K do
for i = 1, . . . , I do
if M(:, i) = U(:, k) then
R(k, i) ← 1
else
R(k, i) ← 0
end if
end for
end for
Output: Modes U and logical matrix R of cluster memberships
Use termination threshold
τ = h/1000
for meanShift, and declare two modes to be the same if they are more than 10τ apart.
Hand in your code and a plot that uses showClusters to show the modes and the clusters found on the bananas data set with
a bandwidth h = 0.8. [You do not have the argument M0, so you can pass the empty matrix instead.] Also report the running time
of your code in seconds, and the number of clusters found.
[Hints: While the algorithm is simple in principle, its computational cost is high: For each starting point zi
in Z and at each
iteration of mean shift it is necessary to compute the distance between z and every point in Z. So with k iterations per point on
average mean shift clustering takes O(kI2
) distance computations.
To reduce the computational burden, note that points zi
that are very far from z (relative to h) lead to near-zero values of Kh(z−zi),
and can be ignored. Ways to take advantage of this fact are beyond the scope of this course, so you may use the function
rangesearch available in the MATLAB Statistics Toolbox. If you do not have that toolbox, you can download a similar
function from the Mathworks web site by searching for ”Yi Cao fast range search”. Ignore distances that are greater than 2h.]
(b) Look (carefully) at your result in your previous answer and comment on the extent to which mean shift clustering addresses the
problems of EM. Discuss trade-offs on the selection of the bandwidth h.
COMPSCI 527 — Duke — October 20, 2014