Data clustering: a fundamental method in data science and management
Tai Dinh, Wong Hauchi, Daniil Lisik, Michal Koren, Dat Tran, Philip S. Yu, Joaquín Torres-Sospedra
TL;DR
The paper surveys data clustering as a foundational unsupervised technique in data science and delineates its role within KDD and the broader data science workflow. It formalizes clustering objectives and provides a taxonomy of methodologies, validation metrics, and a practical clustering workflow. Key contributions include a comprehensive survey of algorithms (partitional, hierarchical, density-based, model-based, subspace, graph-based, data-stream, and ensemble approaches), plus an overview of libraries and tools and the challenges of practical deployment. The discussion highlights clustering's potential to drive data-driven decision-making and suggests future directions, including integration with large language models and scalable, adaptive clustering in evolving data.
Abstract
This paper explores the critical role of data clustering in data science, emphasizing its methodologies, tools, and diverse applications. Traditional techniques, such as partitional and hierarchical clustering, are analyzed alongside advanced approaches such as data stream, density-based, graph-based, and model-based clustering for handling complex structured datasets. The paper highlights key principles underpinning clustering, outlines widely used tools and frameworks, introduces the workflow of clustering in data science, discusses challenges in practical implementation, and examines various applications of clustering. By focusing on these foundations and applications, the discussion underscores clustering's transformative potential. The paper concludes with insights into future research directions, emphasizing clustering's role in driving innovation and enabling data-driven decision-making.
