Name: COSC 3337 Week 10 Lab (K Means) solved
SKU: 24907
Price: 30.00 USD
Availability: InStock

Description

5/5 - (6 votes)

About The Data
We’ll be using the Customer Dataset from kaggle for this lab, but feel free to follow along with your own dataset. The dataset
contains the following attributes:
CustomerID
Genre
Age
AnnualIncome(k$)
Spending_Score
Our goal is to group/cluster these customers.
About K Means
K Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Unsupervised learning
means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data. In k means clustering, we
have the specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to
a cluster, and finds the centroid of each cluster. Then, the algorithm iterates through two steps: Reassign data points to the
cluster whose centroid is closest. Calculate new centroid of each cluster. These two steps are repeated till the within cluster
variation cannot be reduced any further. The within cluster variation is calculated as the sum of the euclidean distance between
the data points and their respective cluster centroids. Refer back to the lecture video or slides for more detail on K Means.
Implementation
Because K Means is used more for finding patterns in our data, we’ll skip the data exploration portion, but you’re welcome to
explore this data or your own if working with a different dataset.
Let’s first load the data into a pandas DataFrame. We’ll use the CustomerID column as our index_col for this DataFrame.
Genre Age Annual_Income_(k$) Spending_Score
CustomerID
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
calling .info() we see that there are no missing values in this dataset since there are 200 entries in total and 200 non‑null entries
in each column.
<class ’pandas.core.frame.DataFrame’>
Int64Index: 200 entries, 1 to 200
Data columns (total 4 columns):
#   Column              NonNull Count  Dtype
                  
0   Genre               200 nonnull    object
1   Age                 200 nonnull    int64
2   Annual_Income_(k$)  200 nonnull    int64
3   Spending_Score      200 nonnull    int64
dtypes: int64(3), object(1)
memory usage: 7.8+ KB
Age Annual_Income_(k$) Spending_Score
count 200.000000 200.000000 200.000000
mean 38.850000 60.560000 50.200000
std 13.969007 26.264721 25.823522
min 18.000000 15.000000 1.000000
25% 28.750000 41.500000 34.750000
50% 36.000000 61.500000 50.000000
75% 49.000000 78.000000 73.000000
max 70.000000 137.000000 99.000000
To ensure that we don’t have any duplicates, we can call .drop_duplicates(inplace=True) on our DataFrame.
Just so that we can visualize our clusters in the end of this lab, we’ll go ahead and only work with 2 variables (spending score and
income). However, you’re free to use more than 2 variables if you’re working with your own dataset.
We’ll now use the elbow method to find the optimal number of clusters.
We’re now ready to create our KMeans model and run our predictions on the X vector we created earlier with spending score and
income.
Note: You won’t typically be plotting the clusters to visualize since you’ll usually have more than 2 variables, but since we only
worked with 2 variables, let’s go ahead and visualize our clusters.
Congrats! You know know how to use KMeans in sklearn. Try repeating the lab steps on your own data for practice. Since we
don’t have the ground truth (unsupervised) to compare and evaulate performance, there’s not much more we can do here to
evaulate our model like we’re used to doing. You’ll later learn about Silhouette analysis, which will come in handy.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
from matplotlib import rcParams
rcParams[‘figure.figsize’] = 15, 5
sns.set_style(‘darkgrid’)
In [3]:
customer_df = pd.read_csv(‘customers.csv’, index_col=’CustomerID’)
customer_df.head()
Out[3]: In [4]:
customer_df.info()
In [5]:
customer_df.describe()
Out[5]: In [6]:
customer_df.drop_duplicates(inplace=True)
In [12]:
# Saving only Spending_Score and income values into X.
X = customer_df.iloc[:, [2, 3]].values
In [15]:
from sklearn.cluster import KMeans
# where we’ll store all of the wcss values for plotting later.
wcss = []
for i in range(1, 11):
# random_state just to ensure we get the same values in the end.
kmeans = KMeans(n_clusters = i, random_state = 42)
kmeans.fit(X)
# inertia method returns wcss for that model.
wcss.append(kmeans.inertia_)
# creating lineplot to visualize wcss and find optimal number of clusters
sns.lineplot(x=range(1, 11), y=wcss,marker=’o’,color=’red’)
plt.title(‘Elbow Method’)
plt.xlabel(‘Number of clusters’)
plt.ylabel(‘WCSS’)
plt.show()
In [16]:
kmeans = KMeans(n_clusters = 5, init = ‘kmeans++’, random_state = 42)
y_pred = kmeans.fit_predict(X)
In [19]: sns.scatterplot(x=X[y_pred == 0, 0], y=X[y_pred == 0, 1], color = ‘yellow’, label = ‘Cluster 1’,s=50)
sns.scatterplot(x=X[y_pred == 1, 0], y=X[y_pred == 1, 1], color = ‘blue’, label = ‘Cluster 2’,s=50)
sns.scatterplot(x=X[y_pred == 2, 0], y=X[y_pred == 2, 1], color = ‘green’, label = ‘Cluster 3’,s=50)
sns.scatterplot(x=X[y_pred == 3, 0], y=X[y_pred == 3, 1], color = ‘grey’, label = ‘Cluster 4’,s=50)
sns.scatterplot(x=X[y_pred == 4, 0], y=X[y_pred == 4, 1], color = ‘orange’, label = ‘Cluster 5’,s=50)
sns.scatterplot(x=kmeans.cluster_centers_[:, 0], y=kmeans.cluster_centers_[:, 1], color = ‘red’,
label = ‘Centroids’,s=100,marker=’,’)
plt.title(‘Clusters of customers’)
plt.xlabel(‘Annual Income’)
plt.ylabel(‘Spending Score’)
plt.legend()
plt.show()
COSC 3337

COSC 3337 Week 10 Lab (K Means) solved

Description

Related products

COSC 3337 Week 5 Lab (Logistic Regression) solved

COSC 3337 Week 6 Lab solution

COSC 3337 Week 12 Lab (DBSCAN) solved