What is Cluster Analysis?
Definition, History and Benefits
Cluster analysis is a statistical tool used to classify objects into groups, such that the objects belonging to one group are much more similar to each other and rather different from objects belonging to other groups. It is generally used for exploratory data analysis and serves as a method of discovery by solving classification issues.
Cluster Analysis is done using two categories of methods –
Hierarchical Cluster Analysis methods
- Agglomerative methods – in this, all objects start in separate clusters till slowly similar objects are combined and this process is repeated till all objects are in a single cluster. Finally, the optimum number of clusters is chosen from among all options.
- Divisive methods – in this, all objects start in the same cluster and the reverse of the agglomerative method is used.
Non-hierarchical Cluster Analysis methods(also known as k-means clustering methods)
These are generally used when large data sets are involved. Further, these provide the flexibility of moving a subject from one cluster to another.
Cluster Analysis as a statistical tool has greatly evolved over the years. In the distant past, a lot of work went into making the algorithms of Cluster Analysis simpler and less computer intensive. This was because the earlier generation of computers was not capable of fast and accurate calculations that are possible today. Keeping this in mind, a lot of computational shortcuts were incorporated into the algorithms. However, over time the capacity of computers has improved manifold and there is a greater need to process Big Data. This has led to the development of pre-clustering methods like ‘Canopy Clustering’; and the presence of highly dimensional data has led to newer sub-space and density based clustering methods. Also, the latest developments in computer science and statistical physics have led to the development of ‘message passing’ algorithms in Cluster Analysis today.
The main benefit of Cluster Analysis is that it allows us to group similar data together. This helps us identify patterns between data elements. It reveals associations between data objects and helps to outline structure which might not have been apparent previously but gives much sense and meaning to the data when discovered. Once a clear structure emerges, it allows easier decision making.
Illustration and Practical Applications of Cluster Analysis
For understanding how Cluster Analysis is used, let’s take an example. Say, you are a retail chain with 100’s of stores across locations. How do you conduct assortment planning and best manage store performance? Cluster analysis will provide you with the desired insights on customer demographics, purchase behavior and demand patterns across locations. This will help you in conducting assortment planning, planning your promotional activities and store benchmarking for better performance and higher returns. This is the kind of analysis and decision making that Cluster Analysis can help with.
In the field of marketing, Cluster Analysis is widely used for market segmentation and positioning, and to identify test markets for new product development. In the areas of social networking and social media, Cluster Analysis is used to identify similar communities within larger groups. Cluster Analysis has also been widely used in the field of biology and medical science like human genetic clustering, sequencing into gene families, building groups of genes, and clustering of organisms at species, genus or higher levels.
R is an open source programming package often used for statistical analysis and we can use it to conduct R Cluster Analysis as well. R has a wide variety of options and approaches available for conducting Cluster Analysis, including hierarchical agglomerative, model based and partitioning methods.
SPSS is another statistical software package used widely to conduct Cluster Analysis. Cluster Analysis can easily be conducted in SPSS in a simple 8-10 step process. SPSS allows you to choose the variables on which you want to conduct Cluster Analysis and the method you want to use. In case you are using extremely small data sets, you may use the option to display the distance between all observations within the data set. You also have the option to include or exclude details of the cluster memberships.