Comparative Analysis of Optimization Strategies for K-means Clustering in Big Data Contexts: A Review
Ravil Mussabayev, Rustam Mussabayev
TL;DR
This paper surveys optimization strategies for K-means-like clustering in big data, formalizing the MSSC objective as $f(C,X)=\sum_{i=1}^m \min_{j=1,...,k} \|x_i-c_j\|^2$ and evaluating them through the Less-is-More (LIMA) dominance lens. It categorizes approaches into data decomposition, parallelization, memory efficiency, canopy pre-clustering, triangle-inequality pruning, sampling-based initialization, approximation, and hybrid methods, providing a comparative, experiment-backed guide for practitioners. Key findings show that the Big-means algorithm often achieves the best balance of accuracy, speed, and simplicity under LIMA, while fast but less accurate methods like Minibatch K-means may fail to deliver robust quality on diverse big data tasks. The study offers actionable recommendations, including an algorithm selection flowchart, and highlights directions for future research such as integrating modern metaheuristics and automated parameter tuning to further advance true big data MSSC clustering.
Abstract
This paper presents a comparative analysis of different optimization techniques for the K-means algorithm in the context of big data. K-means is a widely used clustering algorithm, but it can suffer from scalability issues when dealing with large datasets. The paper explores different approaches to overcome these issues, including parallelization, approximation, and sampling methods. The authors evaluate the performance of various clustering techniques on a large number of benchmark datasets, comparing them according to the dominance criterion provided by the "less is more" approach (LIMA), i.e., simultaneously along the dimensions of speed, clustering quality, and simplicity. The results show that different techniques are more suitable for different types of datasets and provide insights into the trade-offs between speed and accuracy in K-means clustering for big data. Overall, the paper offers a comprehensive guide for practitioners and researchers on how to optimize K-means for big data applications.
