Cluster analysis

Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)

Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University

1 Learning Objectives

In this lecture,you will be able to

  • Understand what cluster analysis is
  • Use R for cluster analysis

2 Principle

2.1 Definition

Cluster analysis/Clustering:

A simple and intuitive method of data simplification. It splits a number of observations or measured variables into groups that are similar in their characteristics or behaviors.

Cluster:

a group of objects such that the objects in the cluster are closer (more similar) to the ‘centre’ of their cluster than to the centre of any other cluster. The centre of a cluster is usually the centroid, the average of all the points in the cluster, or the most ‘representative’ point in the cluster.

2.2 Examples

plot(iris$Petal.Length, iris$Petal.Width, col = iris$Species, pch = 16)
plot(iris$Petal.Length, iris$Petal.Width, pch = 16)

library(rgl)
plot3d(iris$Petal.Length, iris$Petal.Width, iris$Sepal.Length, type = 's', col = as.integer(iris$Species))
plot3d(iris$Petal.Length, iris$Petal.Width, iris$Sepal.Length, type = 's')

iris2trans <- BoxCox(as.matrix(iris[, c('Petal.Length', 'Petal.Width')]), lambda = "auto")
iris2a <- agnes(iris2trans, metric = "euclidean", method = "average")
pltree(iris2a)
par(mfrow = c(1,3))
plot(iris$Petal.Length, iris$Petal.Width, col = cutree(iris2a, k=2), pch = 16)
plot(iris$Petal.Length, iris$Petal.Width, col = cutree(iris2a, k=3), pch = 16)
plot(iris$Petal.Length, iris$Petal.Width, col = iris$Species, pch = 16)

iris3trans <- BoxCox(as.matrix(iris[, c('Petal.Length', 'Petal.Width', 'Sepal.Length')]), lambda = "auto")
iris3a <- agnes(iris3trans, metric = "euclidean", method = "average")
pltree(iris2a)
plot3d(iris$Petal.Length, iris$Petal.Width, iris$Sepal.Length, type = 's', col = cutree(iris3a, k=2))
plot3d(iris$Petal.Length, iris$Petal.Width, iris$Sepal.Length, type = 's', col = cutree(iris3a, k=3))
open3d()
plot3d(iris$Petal.Length, iris$Petal.Width, iris$Sepal.Length, type = 's', col = as.integer(iris$Species))

2.3 Methods

2.3.1 K-means

Centroid models:

Iterative clustering algorithms. Iteratively correct the centroid positions of the clusters and adjust the classification of each data point.

K-mean:

Classify a given data set by a certain number (K) of predefined clusters. The most common centroid-based clustering algorithm.

Steps:

  1. Specify the number of clusters k.
  2. Assign points to clusters randomly.
  3. Find the centroids of each cluster.
  4. Re-assign points according to their closest centroid.
  5. Re-adjust the positions of the cluster centroids.
  6. Repeat steps 4 and 5 until no further changes are there.

Example of K-mean cluster analysis process

Advantages:

  1. The fastest centroid-based algorithm.
  2. Work for large data sets.
  3. Reduce intra-cluster variance

Disadvantages:

  1. Performance is affected when there is more noise in the data.
  2. Outliers can never be studied.
  3. Even though it reduces intra-cluster variance, it can’t affect or deal with the global minimum variance of measure.
  4. Very sensitive at clustering data sets of non-convex shaped clusters.

The most commonly used implementation of k-means clustering is one that tries to find the partition of the n individuals into k groups that minimises the within-group sum of squares over all variables.

Table 1. Number of possible partitions depending on the sample size n and number of clusters k. It would be impossible to make a complete enumeration of every possible partition, even with the fastest computers.

2.3.2 Hierarchical cluster analysis (HCA)

Suitable for clustering of small data sets because of the cost on time and space.

Advantages:

  • The similarity of distance and rules is easy to define and less restrictive.
  • There is not required to specify the clustering number in advance.
  • It can discover the hierarchy of clusters, showing by the tree.

Disadvantages:

  • The algorithm is complex.
  • It does not scale well, because of the difficulty of choosing merging or splitting points.

Approaches:

  • Agglomerative (AGNES): Begin with each observation in a distinct (singleton) cluster, and successively merges clusters together until a stopping criterion is satisfied. Bottom-up.
  • Divisive (DIANA): Begin with all patterns in a single cluster and performs splitting until a stopping criterion is met. Top-down.

Figure 4.2 The distinguish of the hierarchical cluster

3 Workflow

3.1 Data and question

  • Same unit —- because of the ignoring of physical meanings, focusing on number.
  • Same dimension —- because of the measuring of distance.
newIris <- iris[1:4] 

3.2 The kmeans method

Compare the Total Within Sum of Square under different k values. Determine the optimal clustering number.

library(factoextra)
fviz_nbclust(newIris, kmeans, method = "wss")

Set k = 3. Clustering.

kc <- kmeans(newIris, 3) 

Create a cross tabulation of clustering and original species to Check the results.

table(iris$Species,kc$cluster) 
            
              1  2  3
  setosa     17 33  0
  versicolor  4  0 46
  virginica   0  0 50

Visualization.

par(mfrow = c(1, 2))
plot(newIris[c("Sepal.Length", "Sepal.Width")],
     col = kc$cluster,
     pch = as.integer(iris$Species))
points(kc$centers,
       col = 1:3,
       pch = 16,
       cex = 2)
plot(newIris[c("Petal.Length", "Petal.Width")],
     col = kc$cluster,
     pch = as.integer(iris$Species))
points(kc$centers,
       col = 1:3,
       pch = 16,
       cex = 2)

3.3 The HCA method

iris3R <- as.matrix(iris[2:4, -5])
row.names(iris3R) <- c("Object1", "Object2", "Object3")
Box-Cox transformation
A flexible data normalization method which can minimize the skewness. More suitable for the (applied geochemical and) general environmental data. The data for Box-Cox transformation is required to be positive and continuous variables.
library(forecast)
iris.transformed <- BoxCox(iris3R, lambda = "auto")
iris.transformed
        Sepal.Length Sepal.Width Petal.Length Petal.Width
Object1     3.725276    1.941234    0.3967354  -0.8225575
Object2     3.539252    2.131053    0.2981108  -0.8225575
Object3     3.446104    2.036214    0.4950348  -0.8225575
attr(,"lambda")
[1] 0.9538139
#calculate the distance matrix
distM <- dist(as.matrix(iris.transformed), method = "euclidean")
# default using "euclidean" for distance measure if not choose methods. Euclidean is a good choice when using low-dimensional data, but usually, it need to normalize the data before.

The distance between Object2 and Object3 is minimum.

#Perform HCA (AGNES)
hc <- hclust(distM, method = "complete")
# default using "complete"(-linkage) as linkage method if not choose methods
# Draw the dendrogram(Because the cluster solutions grow tree-like)
plot(hc)

Object2 and Object3 are more similar, and Object1 is more different.

But if change the distance measure to “canberra”:

distC <- dist(as.matrix(iris.transformed), method = "canberra")
hc2 <- hclust(distC, method = "complete")
plot(hc2)

Use agglomerative coefficients (AC, for AGNES) and divisive coefficient (DC, for DIANA) to compare the strength of the hierarchical clustering structures. The stronger the clustering structure, the higher the coefficient (closer to 1).

library(cluster)
help("coef.hclust")
coef.hclust(hc)
coef.hclust(hc2)

As “canberra” has a higher AC, it is stronger.

The iris dataset:

library(forecast)
iris.trans <- BoxCox(as.matrix(iris[,-5]), lambda = "auto")
library("cluster")

# Perform AGNES
irisHcA <- agnes(iris.trans, metric = "euclidean", method = "average")# The default setting
# Cut the AGNES tree at cluster number(k) = 3
hclusterA <- cutree(irisHcA, k = 3)
tb1 <- table(true = iris$Species, cluster = hclusterA)
tb1
            cluster
true          1  2  3
  setosa     50  0  0
  versicolor  0 46  4
  virginica   0 50  0
# Extract the agglomerative coefficient. When using agnes/diana(), it's automatically calculated
irisHcA$ac # k value will not affect the agglomerative coefficients
[1] 0.9378719
# Perform DIANA
irisHcD <- diana(iris.trans, metric = "euclidean")# No linkage method, default setting
# Cut at k = 3
hclusterD <- cutree(irisHcD, k = 3)
tb2 <- table(true = iris$Species, cluster = hclusterD)
tb2
            cluster
true          1  2  3
  setosa     50  0  0
  versicolor  0 46  4
  virginica   0  3 47
#Extract divisive coefficient
irisHcD$dc
[1] 0.9555297

Change the linkage method can affect the clustering result:

#change the linkage method
irisHcS <- agnes(iris.trans, method = "single")
hclusterS <- cutree(irisHcS, k = 3)
tb3 <- table(true = iris$Species, cluster = hclusterS)
tb3
            cluster
true          1  2  3
  setosa     49  1  0
  versicolor  0  0 50
  virginica   0  0 50
irisHcS$ac
[1] 0.8807232

4 Comparison

k-means Clustering Hierarchical Clustering
k-means, using a pre-specified number of clusters, the method assigns records to each cluster to find the mutually exclusive cluster of spherical shape based on distance. Hierarchical methods can be either divisive or agglomerative.
K Means clustering needed advance knowledge of K, i.e. the number of clusters one want to divide your data. In hierarchical clustering one can stop at any number of clusters, one find appropriate by interpreting the dendrogram.
One can use median or mean as a cluster centre to represent each cluster. Agglomerative methods begin with ‘n’ clusters and sequentially combine similar clusters until only one cluster is obtained.
Methods used are normally less computationally intensive and are suited with very large datasets. Divisive methods work in the opposite direction, beginning with one cluster that includes all the records and Hierarchical methods are especially useful when the target is to arrange the clusters into a natural hierarchy.
In K Means clustering, since one start with random choice of clusters, the results produced by running the algorithm many times may differ. In Hierarchical Clustering, results are reproducible in Hierarchical clustering
K- means clustering a simply a division of the set of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset). A hierarchical clustering is a set of nested clusters that are arranged as a tree.
K Means clustering is found to work well when the structure of the clusters is hyper spherical (like circle in 2D, sphere in 3D). Hierarchical clustering don’t work as well as k means when the shape of the clusters is hyper spherical.
Advantages: 1. Convergence is guaranteed. 2. Specialized to clusters of different sizes and shapes. Advantages: 1 .Ease of handling of any forms of similarity or distance. 2. Consequently, applicability to any attributes types.
Disadvantages: 1. K-Value is difficult to predict 2. Didn’t work well with global cluster. Disadvantage: 1. Hierarchical clustering requires the computation and storage of an n×n distance matrix. For very large datasets, this can be expensive and slow

5 Further readings

  • Statistical Data Analysis Explained. Chapter 15.

6 Exercises

Carry out clustering analysis on the “mtcars” dataset.