A Rapid Review of Clustering Algorithms
Hui Yin, Amir Aryani, Stephen Petrie, Aishwarya Nambissan, Aland Astudillo, Shengyuan Cao
TL;DR
This rapid review addresses the challenge of choosing appropriate clustering methods across diverse data tasks by proposing a five-dimensional classification framework: underlying principles, data-point assignment, dataset capacity, predefined cluster numbers, and application area. It surveys mainstream algorithms within five principle families (Partition, Hierarchical, Density, Grid, Model-based) and contrasts hard versus soft clustering, scalability considerations, and strategies to determine the number of clusters. The paper also synthesizes internal and external evaluation metrics, detailing widely used measures such as Silhouette, Davies-Bouldin, Dunn's index, ARI, and NMI, along with their limitations. It highlights current trends, including deep-learning hybrids and domain-specific adaptations, and underscores the absence of a universal clustering solution, offering practical guidance for task-dependent method selection and identifying open challenges for future work.
Abstract
Clustering algorithms aim to organize data into groups or clusters based on the inherent patterns and similarities within the data. They play an important role in today's life, such as in marketing and e-commerce, healthcare, data organization and analysis, and social media. Numerous clustering algorithms exist, with ongoing developments introducing new ones. Each algorithm possesses its own set of strengths and weaknesses, and as of now, there is no universally applicable algorithm for all tasks. In this work, we analyzed existing clustering algorithms and classify mainstream algorithms across five different dimensions: underlying principles and characteristics, data point assignment to clusters, dataset capacity, predefined cluster numbers and application area. This classification facilitates researchers in understanding clustering algorithms from various perspectives and helps them identify algorithms suitable for solving specific tasks. Finally, we discussed the current trends and potential future directions in clustering algorithms. We also identified and discussed open challenges and unresolved issues in the field.
