Sale!

CSE 5334 Data Mining Assignment 1 solved

Original price was: $35.00.Current price is: $30.00. $25.50

Category:

Description

5/5 - (4 votes)

Problem 1
(k-means, 50pts) Generate 2 sets of 2-D Gaussian random data, each set containing 500 samples using
parameters below.
µ1 = [1, 0], µ2 = [0, 1.5], Σ1 =

0.9 0.4
0.4 0.9

, Σ2 =

0.9 0.4
0.4 0.9

(1)
1. (20pts) Write a function cluster = mykmeans(X, k, c) that clusters data X ∈ R
n×p
(n number of
objects and p number of attributes) into k clusters. The c here is the initial centers, although this is
usually not necessary, we will need it to test your function. Terminate the iteration when the `2-norm
between a previous center and an updated center is ≤ 0.001 or the number of iteration reaches 10000.
2. (15pts) Apply your code to the data generated above with k = 2 and initial centers c1 = (10, 10) and
c2 = (−10, −10). In your report, report the centers found for each cluster. How many iterations did it
take? Show a scatter plot of the data and the centers of clusters found.
3. (15pts) Apply your code to the data generated above with k = 4 and initial centers c1 = (10, 10) and
c2 = (−10, −10), c3 = (10, −10) and c4 = (−10, 10). In your report, report the centers found for each
cluster. How many iterations did it take? Show a scatter plot of the data and the centers of clusters
found.
CSE5334 Data Mining Assignment 1
Problem 2
(text clustering, 50pts)
Dataset: https://www.kaggle.com/noushad24/amazon-reviews/download
1. (15pts) Let us focus on the reviews from the dataset without their labels. Build a weight matrix about
all the words in the review part. Represent each review as a real-valued vector of tf-idf. Report each
preprocessing step you applied and attach the corresponding part of your code. Visualize the matrix
with color code (i.e., show the matrix as an 2D image where pixel intensity represents the weight).
2. (15pts) Pick your own 5 “positive” words and 5 “negative” words, which indicate if a product is good
or bad, respectively. List the words you selected. Represent each review in a vector space of these ten
words (i.e., count matrix) as well as tf-idf weight matrix.
3. (20pts) For each review, sum up the frequency of “positive” words and “negative” words. Represent
each review as a vector of length 2. Now the reviews can be shown in 2D space, while one dimension
is about “positive” and the other one is “negative”. Apply your code from Problem 1 to this 2D data
with k = 2, 3, 4 with randomly initialized centers. In your report, report the centers found for each
cluster. How many iterations did it take? Show a scatter plot of the data and the centers of clusters
found.