• Sai Manoj


In this post we will learn about clustering and understand the role of clusters and their importance in analytics and learn to built clusters using sklearn library in python.

What is Clustering?

Clustering is one of the most frequently used analytics applications. It helps data scientist to create homogeneous group of entities for better management of entities. Clustering algorithms are unsupervised learning algorithms whereas classification problems are supervised learning algorithms. Another important difference between clustering and classification is that clustering is descriptive analysis whereas classification is usually a predictive analysis algorithm.

One of the example of clustering is Product Segmentation.I will discuss a case in this section,where a company would like to enter the market with a new beer brand Before,it decides the kind of beer it will launch,it must understand what kinds of products already exist in the market and what kinds of segments the products address. To understand the segments the company collects specification of few samples of beer brands.

In this section we will use the dataset beer.csv of beer brands and their corresponding features such as calories, sodium, alcohol and cost.

Each observation belongs to a beer brand and it contains information about the calories, alcohol and sodium content, and the cost.

Import required libraries:

Beer Dataset:

The first step in creating clusters is to normalize the features in beer dataset.

Techniques used for finding clusters:

1️.Using Dendogram

A Dendogram is a cluster tree diagram which groups the entities together that are nearer to each other, and it is drawn using the clustermap( ) method.

Dendogram records the observation based on how close they are to each other using Euclidean distance.The tree on the left of the dendogram depicts the relative distance between nodes.

What is Euclidean Distance?

The radical distance between two observations. If there are many attributes,then the distance across all the attributes is calculated to find out distance.

Dendogram of beer dataset:

Brands 3 and 11 seem to most different as the distance is highest.They are represented on two extremes of the dendogram.

Brand Kingfisher Ult seems to have very low alcohol content. This can be data error. Thus,it can be dropped from the dataset.

2️.Using Elbow Curve Method:

If we assume all the products belong to only one Segment then the variance of the cluster will be highest. So when we increase the number of clusters the total variance of all clusters will start reducing. But the total variance will be zero if we assume each product is a cluster by itself.

So , Elbow Curve Method considers the Percentage of variance explained as a function of the number of clusters.

For a set of records (X1,X2,..Xn),where each observation is a d-dimensional real vector ,K-means clustering algorithm segments the observation into k sets S={S1,S2..SK} to minimize the within -cluster sum of squares(WCSS).

The “interia_” parameter in K-Means cluster algorithms provides the total variance for a party number of clusters.

The below code iterates and creates clusters ranging from 1 to 10 and captures the total variance in the variance cluster_errors.

The cluster_errors is plotted against the number of clusters.

X-axis= Number of Clusters,Y-axis=Sum of Squares of Error.

In the above plot we can see the elbow point is at 3,which indicates there might be three clusters in the dataset.

Normalizing the Features:

Standardize using “StandardScaler”.

Creating Clusters

K-Means Clustering:

K-Means Clustering is an non-hierarchical clustering method in which the number of clusters(K) is decided a priori The observations in the sample are assigned to one of the clusters say(C1,C2..Ck) based on the distance between the observation and the centroid of the clusters.

sklearn library contains KMeans algorithm.

Set k=3 for running KMeans algorithm and create a new column clusterid to keep the cluster number it is assigned to.


In cluster0, beers with with medium cost and medium calories are grouped together.

Cluster 1:

In cluster1, all the beers with high calories,high cost are grouped together.

Cluster 2:

In cluster2,light beers are grouped (i.e alcohol content is almost 0)

Hierarchical Clustering:

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.

“Agglomerative Clustering” in sklearn.cluster provides an algorithm for hierarchical clustering and also takes the number of clusters to be created as an argument.

The agglomerative hierarchical clustering can be represented and understood by using dendogram.

We will create clusters using Agglomerative Clustering and store the new clusters in h_clusterid variable.

Here "clusterid" are clusters created using K-Means and "h_clusterid" are clusters using Hierarchical Clustering .


One of the decisions to be taken during clustering is to decide on the number of clusters. Which is done by using elbow curve. The cluster number at which the bend occurs in the curve is the optimal number of clusters.

Let’s discuss in comments if you find anything wrong in the post or if you have anything to add.

Credits and Sources - www.statquest.com

Data OilSt.