Sale!

Homework 4 CS 540 solution

$30.00 $25.50

Category:

Description

5/5 - (7 votes)

Assignment Goals
• Process the real-world data
• Implement hierarchical clustering
• Visualize the clustering process
You are to perform hierarchical clustering on publicly available Pokemon stats. Each Pokemon is defined by a row in the data set. Because there are various ways to characterize
how strong a Pokemon is, we summarize the stats into a shorter feature vector. For this
assignment, you must represent a Pokemon’s quality by 6 numbers:
HP, Attack, Defense, Sp. Atk, Sp. Def, Speed
After each Pokemon is represented as a 6-dimensional feature vector (x1, . . . , x6), you need to
cluster the first n Pokemon with hierarchical agglomerate clustering (HAC). Your function
should work similarly to scipy.cluster.hierarchy.linkage() with method=’complete’.
Program Overview
The data in CSV format can be found in the file Pokemon.csv. Note, there is no starter code
for this assignment. If you feel stuck, refer to the starter code from past assignments. You
will have to write a few python functions for this assignment. Here is a high level description
of each for reference:
1. load_data(filepath) — takes in a string with a path to a CSV file, and returns the
data points as a list of dicts. Section 0.1
2. calc_features(row) — takes in one row dict from the data loaded from the previous
function then calculates the corresponding feature vector for that Pokemon as specified
above, and returns it as a numpy array of shape (6,). The dtype of this array should
be int64. Section 0.2
3. hac(features) — performs complete linkage hierarchical agglomerate clustering on
the Pokemon with the (x1, . . . , x6) feature representation, and returns a numpy array
representing the clustering. Section 0.3
1 Homework 4 CS 540
4. imshow_hac(Z) — visualizes the hierarchical agglomerate clustering on the Pokemon’s
feature representation. Section 0.4
You may implement other helper functions as necessary, but these are the functions we are
testing. In particular, your final python file is just a suite of functions, you should not have
code that runs outside of the functions. To test your code, you may want a “main” method
to put it all together. Make sure, you either delete any testing code running outside functions
or wrap it in a if __name__==”__main__”:. This is discussed more in Section 0.5.
Program Details
0.1 load_data(filepath)
Summary. [20pts]
• Input: string, the path to a file to be read.
• Output: list; each element of the list is a dict representing one row of the file
read; the key of dict is a string(such as ‘Attack’) and the value of the string is a
string(such as ‘111’).
Details.
1. Read in the file specified in the argument, filepath. Note, the DictReader from
Python’s csv module (https://docs.python.org/3/library/csv.html#csv.DictReader) is
useful but, depending on your python version, this might return OrderedDicts instead
of normal dicts. Make sure you convert to dict as appropriate if you choose to use
this function.
2. Return a list of dictionaries, where each row in the dataset is a dictionary with
the column headers as keys and the row elements as values. The keys are strings:
‘HP’, ‘Attack’, ‘Defense’, ‘Sp. Atk’, ‘Sp. Def’, ‘Speed’.
You may assume the file exists and is a properly formatted CSV.
0.2 calc_features(row)
Summary. [20pts]
• Input: dict representing one Pokemon.
• Output: numpy array of shape (6,) and dtype int64. The first element is x1 and so
on with the sixth element being x6.
2 Homework 4 CS 540
Details. This function takes as input the dict representing one Pokemon, and computes
the feature representation (x1, . . . , x6). Specifically,
1. x1 = Attack
2. x2 = Sp. Attack
3. x3 = Speed
4. x4 = Defense
5. x5 = Sp. Def
6. x6 = HP
Note, these stats in the dict would be be string. Make sure to convert each relevant stat
to int when computing each xi
. Return a numpy array having each xi
in order: x1, . . . , x6.
The shape of this array should be (6,). The dtype of this array should be int64. Remember,
this function works for only one Pokemon at a time, not all of the ones that you loaded in
load_data simultaneously. Make sure you are outputting the exact data structures with
appropriate types as specified or you risk a major reduction in points.
0.3 hac(features)
Summary. [50pts]
• Input: list of numpy arrays of shape (6,), where each array is an (x1, . . . , x6) feature
representation as computed in Section 0.2. The total number of feature vectors, i.e.
the length of the input list, is n. Note, we test your code on different n’s as stated in
Section 0.5).
• Output: numpy array Z of the shape (n−1)×4. For any i, Z[i, 0] and Z[i, 1] represent
the indices of the two clusters that were merged in the ith iteration of the clustering
algorithm. Then, Z[i, 2] = d(Z[i, 0], Z[i, 1]) is the complete linkage distance between
the two clusters that were merged in the ith iteration (this will be a real value, not
integer like the other quantities). Lastly, Z[i, 3] is the size of the new cluster formed by
the merge, i.e. the total number of Pokemon in this cluster. Note, the original Pokemon
are considered clusters indexed by 0, . . . , n − 1, and the cluster constructed in the ith
iteration (i ∈ {0, 1, 2, …, n − 2}) of the algorithm has cluster index (n − 1) + (i + 1).
Also, there is a tie-breaking rule specified below that must be followed.
Details. For this function, we would like you to mimic the behavior of SciPy’s HAC function, linkage(). You may not use this function in your implementation, but we strongly
recommend using it to verify your results! This is how you can test your code.
3 Homework 4 CS 540
Distance. Using complete linkage, perform the hierarchical agglomerate clustering algorithm as detailed in lecture. Use the standard Euclidean distance function for calculating
the distance between two points. You may implement your own distance function or use
numpy.linalg.norm(). Other distance functions might not work as expected so check it
works on the CSL machines first! You are liable for any reductions in points you might get
for using a package distance function.
Outline. Here is one possible path you could follow to implement hac()
1. Number each of your starting data points from 0 to n − 1. These are their original
cluster numbers.
2. Create an (n − 1) × 4 array or list. Iterate through this array/list row by row. For
each row,
(a) Determine which two clusters you should merge and put their numbers into the
first and second elements of the row, Z[i, 0] and Z[i, 1] and Z[i, 0] < Z[i, 1].
(b) The complete-linkage distance between the two clusters goes into the third element
of the row, Z[i, 2]
(c) The total number of Pokemon in the cluster goes into the fourth element, Z[i, 3]
The index of a cluster with more than one Pokemon is given by:
(n − 1) + (i + 1) = n + the row index in which the cluster was created.
3. Before returning the data structure, convert it into a NumPy array if it isn’t one
already.
Note, it may be convenient to maintain a distance matrix throughout the process to avoid
having to recalculate the distances between points or clusters. This is especially important
if your code is slow for small n, such as n ≤ 20.
Tie Breaking. When choosing the next two clusters to merge, we pick the pair having the
smallest complete-linkage distance. In the case that multiple pairs have the same distance,
we need additional criteria to pick between them. We do this with a tie-breaking rule on
indices as follows: Suppose (i1, j1), . . . ,(ih, jh) are pairs of cluster indices with equal distance,
i.e., d(i1, j1) = · · · = d(ih, jh), and assume that it < jt
for all t (so each pair is sorted). We
tie-break by picking the pair with the smallest first index, i. If there are multiple pairs having
first index i, we need to further distinguish between them. Say these pairs are (i, t1),(i, t2), . . .
and so on. To tie-break between these pairs, we pick the pair with the smallest second index,
i.e., the smallest t value in these pairs. Be aware that this tie-breaking strategy may not
produce identical results to linkage().
4 Homework 4 CS 540
0.4 imshow_hac(Z)
Summary. [10pts]
• Input: numpy array Z output from hac.
• Output: None, simply plt.show() a graph that visualizes the hierarchical clustering.
You should use dendrogram in the scipy module.
Here are some examples of successful visualizations for different sized lists of Pokemon:
0.5 Testing
To test your code, try running the following line in a main method or in a jupyter notebook:
hac([calc_features(row) for row in load_data(‘Pokemon.csv’)][:n])
for various choices of n. We do not test on very large values of n so looking at n ≤ 30 should
be sufficient. You can then compare your clustering to what linkage() would give you
(remember, set method = ‘complete’), and look at the different clustering visualizations.
Submission Details
• Please submit your files in a zip file named hw4_.zip
• Inside your zip file, there should be only one file named: hw4.py
• Be sure to remove all debugging output before submission.
5 Homework 4 CS 540