# K-Means Algorithm (with example)

**Introduction**

K-Means is an unsupervised machine learning algorithm. It is one of the most popular algorithm for clustering. It is used to analyze an unlabeled dataset characterized by features, in order to group “similar” data into *k* groups (clusters).

For example, K-Means can be used for behavioral segmentation, anomaly detection, insurance claim, fraud detection, market price or in computer vision etc…

**The algorithm**

How do you define similarity in the data? The similarity between two data points can be measured by the distance between their features.

The algorithm starts by scaling the data and initializing *k *centroids (clusters’ centers). Then, it associates each data point to its closest centroid. On the other hand, it computes the average of the assigned point to determine the new centroid. Each centroid defines a cluster.

- Step 1: Choose
*k*random points as centroids.

Then, the algorithm iterates between the two following steps until convergence.

2. Step 2: Group each object around its nearest centroid by calculating the Euclidian distance.

3. Step 3: Determine the new cluster center by computing the average of the assigned points.

**How do you choose k? — The Elbow Method.**

The number of clusters depends on the data. It is not always obvious. You need to choose it.

If *k* is too small, it won’t represent all of the different categories. If it is too large, it will create unnecessary clusters and can cause overfitting.

One of the method to choose *k* is the **Elbow Method**. It consists of plotting the explained variation as a function of the number of clusters. Then, you pick the elbow of the curve as *k.*

**Example of K-Means with Sklearn (iris dataset)**

### Import the relevant libraries.import numpy as np

import pandas as pd

import matplotlib.pyplot as pltimport sklearn

from sklearn.preprocessing import scale

import sklearn.metrics as sm

from sklearn.metrics import confusion_matrix, classification_reportfrom sklearn.cluster import KMeans

from mpl_toolkits.mplot3d import Axes3D

from sklearn import datasets### Load the dataset

iris = datasets.load_iris()### Scale your data and define your features and your targetsX = scale(iris.data) # Scaled features

y = pd.DataFrame(iris.target) # Target

variable_names = iris.feature_names### How does the data look?pd.DataFrame(data= np.c_[X, y],

columns= variable_names + ['target']).head()

###K-means clustering with k=3clustering = KMeans(n_clusters=3, random_state=5)

clustering.fit(X)

Elbow Method:

### We calculate the distortions for each value of k

### Distortion is the sum of squared distances from each point to its assigned center.distortions = []

K = range(1,10)

for k in K:

kmeanModel = KMeans(n_clusters=k)

kmeanModel.fit(X)

distortions.append(kmeanModel.inertia_)### Plot

plt.figure(figsize=(16,8))

plt.plot(K, distortions, 'bx-')

plt.xlabel('k')

plt.ylabel('Distortion')

plt.title('The Elbow Method showing the optimal k')

plt.show()

### K-Means clustering with k=3clustering = KMeans(n_clusters=3, random_state=5)

clustering.fit(X)

Visualization:

iris_df = pd.DataFrame(iris.data)

iris_df.columns = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width']

y.columns = ['Targets']plt.subplot(1,2,1)plt.scatter(x=iris_df.Petal_Length, c=iris.target, y=iris_df.Petal_Width, s=50)plt.title('True Classification')plt.subplot(1,2,2)plt.scatter(x=iris_df.Petal_Length, c= iris.target, y=iris_df.Petal_Width, s=50)plt.title('K-Means Classification')

**Conclusion**

K-Means clustering is one of the most popular clustering algorithms and gives you a good idea of the structure of the dataset.

However, it doesn’t learn the number of clusters from the data and requires it to be pre-defined, which can be tricky.

If you have further questions, you can contact me on LinkedIn (click here).