K-Means is an unsupervised machine learning algorithm. It is one of the most popular algorithm for clustering. It is used to analyze an unlabeled dataset characterized by features, in order to group “similar” data into k groups (clusters).
For example, K-Means can be used for behavioral segmentation, anomaly detection, insurance claim, fraud detection, market price or in computer vision etc…
How do you define similarity in the data? The similarity between two data points can be measured by the distance between their features.
The algorithm starts by scaling the data and initializing k centroids (clusters’ centers). Then, it associates each data point to its closest centroid. On the other hand, it computes the average of the assigned point to determine the new centroid. Each centroid defines a cluster.
- Step 1: Choose k random points as centroids.
Then, the algorithm iterates between the two following steps until convergence.
2. Step 2: Group each object around its nearest centroid by calculating the Euclidian distance.
3. Step 3: Determine the new cluster center by computing the average of the assigned points.
How do you choose k? — The Elbow Method.
The number of clusters depends on the data. It is not always obvious. You need to choose it.
If k is too small, it won’t represent all of the different categories. If it is too large, it will create unnecessary clusters and can cause overfitting.
One of the method to choose k is the Elbow Method. It consists of plotting the explained variation as a function of the number of clusters. Then, you pick the elbow of the curve as k.
Example of K-Means with Sklearn (iris dataset)
### Import the relevant libraries.import numpy as np
import pandas as pd
import matplotlib.pyplot as pltimport sklearn
from sklearn.preprocessing import scale
import sklearn.metrics as sm
from sklearn.metrics import confusion_matrix, classification_reportfrom sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets### Load the dataset
iris = datasets.load_iris()### Scale your data and define your features and your targetsX = scale(iris.data) # Scaled features
y = pd.DataFrame(iris.target) # Target
variable_names = iris.feature_names### How does the data look?pd.DataFrame(data= np.c_[X, y],
columns= variable_names + ['target']).head()
###K-means clustering with k=3clustering = KMeans(n_clusters=3, random_state=5)
### We calculate the distortions for each value of k
### Distortion is the sum of squared distances from each point to its assigned center.distortions = 
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k)
plt.plot(K, distortions, 'bx-')
plt.title('The Elbow Method showing the optimal k')
### K-Means clustering with k=3clustering = KMeans(n_clusters=3, random_state=5)
iris_df = pd.DataFrame(iris.data)
iris_df.columns = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width']
y.columns = ['Targets']plt.subplot(1,2,1)plt.scatter(x=iris_df.Petal_Length, c=iris.target, y=iris_df.Petal_Width, s=50)plt.title('True Classification')plt.subplot(1,2,2)plt.scatter(x=iris_df.Petal_Length, c= iris.target, y=iris_df.Petal_Width, s=50)plt.title('K-Means Classification')
K-Means clustering is one of the most popular clustering algorithms and gives you a good idea of the structure of the dataset.
However, it doesn’t learn the number of clusters from the data and requires it to be pre-defined, which can be tricky.
If you have further questions, you can contact me on LinkedIn (click here).