# K-Means Algorithm (with example)

Introduction

K-Means is an unsupervised machine learning algorithm. It is one of the most popular algorithm for clustering. It is used to analyze an unlabeled dataset characterized by features, in order to group “similar” data into k groups (clusters).

For example, K-Means can be used for behavioral segmentation, anomaly detection, insurance claim, fraud detection, market price or in computer vision etc…

The algorithm

How do you define similarity in the data? The similarity between two data points can be measured by the distance between their features.

The algorithm starts by scaling the data and initializing k centroids (clusters’ centers). Then, it associates each data point to its closest centroid. On the other hand, it computes the average of the assigned point to determine the new centroid. Each centroid defines a cluster.

1. Step 1: Choose k random points as centroids.

Then, the algorithm iterates between the two following steps until convergence.

2. Step 2: Group each object around its nearest centroid by calculating the Euclidian distance.

3. Step 3: Determine the new cluster center by computing the average of the assigned points.

How do you choose k? — The Elbow Method.

The number of clusters depends on the data. It is not always obvious. You need to choose it.

If k is too small, it won’t represent all of the different categories. If it is too large, it will create unnecessary clusters and can cause overfitting.

One of the method to choose k is the Elbow Method. It consists of plotting the explained variation as a function of the number of clusters. Then, you pick the elbow of the curve as k.

Example of K-Means with Sklearn (iris dataset)

`### Import the relevant libraries.import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport sklearnfrom sklearn.preprocessing import scaleimport sklearn.metrics as smfrom sklearn.metrics import confusion_matrix, classification_reportfrom sklearn.cluster import KMeansfrom mpl_toolkits.mplot3d import Axes3Dfrom sklearn import datasets### Load the datasetiris = datasets.load_iris()### Scale your data and define your features and your targetsX = scale(iris.data) # Scaled featuresy = pd.DataFrame(iris.target) # Targetvariable_names = iris.feature_names### How does the data look?pd.DataFrame(data= np.c_[X, y],                     columns= variable_names + ['target']).head()`
`###K-means clustering with k=3clustering = KMeans(n_clusters=3, random_state=5)clustering.fit(X)`

Elbow Method:

`### We calculate the distortions for each value of k### Distortion is the sum of squared distances from each point to its assigned center.distortions = []K = range(1,10)for k in K:    kmeanModel = KMeans(n_clusters=k)    kmeanModel.fit(X)    distortions.append(kmeanModel.inertia_)### Plot plt.figure(figsize=(16,8))plt.plot(K, distortions, 'bx-')plt.xlabel('k')plt.ylabel('Distortion')plt.title('The Elbow Method showing the optimal k')plt.show()`
`### K-Means clustering with k=3clustering = KMeans(n_clusters=3, random_state=5)clustering.fit(X)`

Visualization:

`iris_df = pd.DataFrame(iris.data)iris_df.columns = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width']y.columns = ['Targets']plt.subplot(1,2,1)plt.scatter(x=iris_df.Petal_Length, c=iris.target, y=iris_df.Petal_Width, s=50)plt.title('True Classification')plt.subplot(1,2,2)plt.scatter(x=iris_df.Petal_Length, c= iris.target, y=iris_df.Petal_Width, s=50)plt.title('K-Means Classification')`

Conclusion

K-Means clustering is one of the most popular clustering algorithms and gives you a good idea of the structure of the dataset.

However, it doesn’t learn the number of clusters from the data and requires it to be pre-defined, which can be tricky.