Table of Contents
Fetching ...

A Rapid Review of Clustering Algorithms

Hui Yin, Amir Aryani, Stephen Petrie, Aishwarya Nambissan, Aland Astudillo, Shengyuan Cao

TL;DR

This rapid review addresses the challenge of choosing appropriate clustering methods across diverse data tasks by proposing a five-dimensional classification framework: underlying principles, data-point assignment, dataset capacity, predefined cluster numbers, and application area. It surveys mainstream algorithms within five principle families (Partition, Hierarchical, Density, Grid, Model-based) and contrasts hard versus soft clustering, scalability considerations, and strategies to determine the number of clusters. The paper also synthesizes internal and external evaluation metrics, detailing widely used measures such as Silhouette, Davies-Bouldin, Dunn's index, ARI, and NMI, along with their limitations. It highlights current trends, including deep-learning hybrids and domain-specific adaptations, and underscores the absence of a universal clustering solution, offering practical guidance for task-dependent method selection and identifying open challenges for future work.

Abstract

Clustering algorithms aim to organize data into groups or clusters based on the inherent patterns and similarities within the data. They play an important role in today's life, such as in marketing and e-commerce, healthcare, data organization and analysis, and social media. Numerous clustering algorithms exist, with ongoing developments introducing new ones. Each algorithm possesses its own set of strengths and weaknesses, and as of now, there is no universally applicable algorithm for all tasks. In this work, we analyzed existing clustering algorithms and classify mainstream algorithms across five different dimensions: underlying principles and characteristics, data point assignment to clusters, dataset capacity, predefined cluster numbers and application area. This classification facilitates researchers in understanding clustering algorithms from various perspectives and helps them identify algorithms suitable for solving specific tasks. Finally, we discussed the current trends and potential future directions in clustering algorithms. We also identified and discussed open challenges and unresolved issues in the field.

A Rapid Review of Clustering Algorithms

TL;DR

This rapid review addresses the challenge of choosing appropriate clustering methods across diverse data tasks by proposing a five-dimensional classification framework: underlying principles, data-point assignment, dataset capacity, predefined cluster numbers, and application area. It surveys mainstream algorithms within five principle families (Partition, Hierarchical, Density, Grid, Model-based) and contrasts hard versus soft clustering, scalability considerations, and strategies to determine the number of clusters. The paper also synthesizes internal and external evaluation metrics, detailing widely used measures such as Silhouette, Davies-Bouldin, Dunn's index, ARI, and NMI, along with their limitations. It highlights current trends, including deep-learning hybrids and domain-specific adaptations, and underscores the absence of a universal clustering solution, offering practical guidance for task-dependent method selection and identifying open challenges for future work.

Abstract

Clustering algorithms aim to organize data into groups or clusters based on the inherent patterns and similarities within the data. They play an important role in today's life, such as in marketing and e-commerce, healthcare, data organization and analysis, and social media. Numerous clustering algorithms exist, with ongoing developments introducing new ones. Each algorithm possesses its own set of strengths and weaknesses, and as of now, there is no universally applicable algorithm for all tasks. In this work, we analyzed existing clustering algorithms and classify mainstream algorithms across five different dimensions: underlying principles and characteristics, data point assignment to clusters, dataset capacity, predefined cluster numbers and application area. This classification facilitates researchers in understanding clustering algorithms from various perspectives and helps them identify algorithms suitable for solving specific tasks. Finally, we discussed the current trends and potential future directions in clustering algorithms. We also identified and discussed open challenges and unresolved issues in the field.
Paper Structure (17 sections, 5 equations, 2 figures, 3 tables)

This paper contains 17 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Structure of the clustering algorithm classification, covering five dimensions.
  • Figure 7: Example of K-Means clustering with two clusters, illustrating two different types of distance between data points.