Cluster analysis is the art of mathematically split objects into groups or clusters whose members are similar in some way and dissimilar to objects belonging to other groups.
There is a plethora of clustering algorithms available with plenty supporters and detractors of all flavours. I am not going to cover all of them.
This post is about how I normally conduct cluster analyses. The way I use cluster analysis as a previous step for predictive modelling - this mean I create cluster and when new objects are available I try to classify them into groups using multiple techniques.
When possible, I’ll try to provide different approaches to achieve the same result using R.
For this example I’m going to use the water-treatment data set from UCI
This dataset comes from the daily measures of sensors in a urban waste water treatment plant. The objective is to classify the operational state of the plant in order to predict faults through the state variables of the plant at each of the stages of the treatment process.
Data Set Description
Number of instances: 527
Number of Attributes: 38
All attributes are numeric and continuous
There are some missing values, all are unknown information.
Comments to the data file:
The first element of each line is the day of the data, the rest are the attribute values
1 Q-E (input flow to plant)
2 ZN-E (input Zinc to plant)
3 PH-E (input pH to plant)
4 DBO-E (input Biological demand of oxygen to plant)
5 DQO-E (input chemical demand of oxygen to plant)
6 SS-E (input suspended solids to plant)
7 SSV-E (input volatile supended solids to plant)
8 SED-E (input sediments to plant)
9 COND-E (input conductivity to plant)
10 PH-P (input pH to primary settler)
11 DBO-P (input Biological demand of oxygen to primary settler)
12 SS-P (input suspended solids to primary settler)
13 SSV-P (input volatile supended solids to primary settler)
14 SED-P (input sediments to primary settler)
15 COND-P (input conductivity to primary settler)
16 PH-D (input pH to secondary settler)
17 DBO-D (input Biological demand of oxygen to secondary settler)
18 DQO-D (input chemical demand of oxygen to secondary settler)
19 SS-D (input suspended solids to secondary settler)
20 SSV-D (input volatile supended solids to secondary settler)
21 SED-D (input sediments to secondary settler)
22 COND-D (input conductivity to secondary settler)
23 PH-S (output pH)
24 DBO-S (output Biological demand of oxygen)
25 DQO-S (output chemical demand of oxygen)
26 SS-S (output suspended solids)
27 SSV-S (output volatile supended solids)
28 SED-S (output sediments)
29 COND-S (output conductivity)
30 RD-DBO-P (performance input Biological demand of oxygen in primary settler)
31 RD-SS-P (performance input suspended solids to primary settler)
32 RD-SED-P (performance input sediments to primary settler)
33 RD-DBO-S (performance input Biological demand of oxygen to secondary settler)
34 RD-DQO-S (performance input chemical demand of oxygen to secondary settler)
35 RD-DBO-G (global performance input Biological demand of oxygen)
36 RD-DQO-G (global performance input chemical demand of oxygen)
37 RD-SS-G (global performance input suspended solids)
38 RD-SED-G (global performance input sediments)
Now we have a bunch of missing values due to the above conversion so we need to fill these gaps.
It is always recommendable to remove variables with very low variability as they will not contribute to the formation of clusters.
Some algorithms like kmeans are susceptible to outliers, badly skewed distribution etc. So checking these things and fixing or improving skewness when possible is a most-do before applying clustering (especially kmeans)
Some variables are badly skewed (positively and negatively). Let’s plot some of them.
We can use Box-Cox transformations to try to fix/improve skeweness.
This is only for one variable so we need to do the above for all other variables with ill-skewness. Whilst we could wrap up the above piece of code in a function there is another way I call the FastTrack using the caret package.
The FastTrack uses the preProcess() function in the caret package to apply BoxCox, center & scale and imputation all in one go. It is pretty neat!
The only thing you need to do before applying the fastTrack approach is to ensure all your variables are continuous.
I always start with k-means as it is easy to implement and understand. I set k to a high value and run k-means from 2 clusters to k clusters and plot the well known elbow graph to have an idea of how many clusters I will need.
For this example I choose k=50.
Note: You may get a slightly different graph as k-means always return different setups due to initial random selection.
Looking at the above graph I’d say k is between 3 and 9 clusters
Initial k-means is giving 3<=k<=9 clusters and I think 4 or 5 clusters seems right but I’ll use hierarchical clustering to gain more insight on the number of clusters. Please notice that I use other clustering algorithms to gain insight on how many clusters I will choose for my previous k-means.
In the above tree graph, what we are looking for is for large stems and it seems like 3,4 or 5 clusters are big candidates.
Let’s see how these different options of clusters looks like.
3 clusters seem like a potential solution. Let’s see how looks like 3 clusters
Validating cluster solutions
Another way of validating cluster is using the clValid are package. Google it for further information.
Using the stability algorithm it seems like 3 and 8 are good options for the number of clusters - however with 8 clusters there is a lot of overlapping among clusters.
So final solution at this stage will be 3 clusters. One also can isolate each cluster and try to find if there are meaningful sub-clusters by repeating the above process.