Comparative Analysis of Optimization Strategies for K-means Clustering in Big Data Contexts: A Review

Ravil Mussabayev; Rustam Mussabayev

Comparative Analysis of Optimization Strategies for K-means Clustering in Big Data Contexts: A Review

Ravil Mussabayev, Rustam Mussabayev

TL;DR

This paper surveys optimization strategies for K-means-like clustering in big data, formalizing the MSSC objective as $f(C,X)=\sum_{i=1}^m \min_{j=1,...,k} \|x_i-c_j\|^2$ and evaluating them through the Less-is-More (LIMA) dominance lens. It categorizes approaches into data decomposition, parallelization, memory efficiency, canopy pre-clustering, triangle-inequality pruning, sampling-based initialization, approximation, and hybrid methods, providing a comparative, experiment-backed guide for practitioners. Key findings show that the Big-means algorithm often achieves the best balance of accuracy, speed, and simplicity under LIMA, while fast but less accurate methods like Minibatch K-means may fail to deliver robust quality on diverse big data tasks. The study offers actionable recommendations, including an algorithm selection flowchart, and highlights directions for future research such as integrating modern metaheuristics and automated parameter tuning to further advance true big data MSSC clustering.

Abstract

This paper presents a comparative analysis of different optimization techniques for the K-means algorithm in the context of big data. K-means is a widely used clustering algorithm, but it can suffer from scalability issues when dealing with large datasets. The paper explores different approaches to overcome these issues, including parallelization, approximation, and sampling methods. The authors evaluate the performance of various clustering techniques on a large number of benchmark datasets, comparing them according to the dominance criterion provided by the "less is more" approach (LIMA), i.e., simultaneously along the dimensions of speed, clustering quality, and simplicity. The results show that different techniques are more suitable for different types of datasets and provide insights into the trade-offs between speed and accuracy in K-means clustering for big data. Overall, the paper offers a comprehensive guide for practitioners and researchers on how to optimize K-means for big data applications.

Comparative Analysis of Optimization Strategies for K-means Clustering in Big Data Contexts: A Review

TL;DR

This paper surveys optimization strategies for K-means-like clustering in big data, formalizing the MSSC objective as

and evaluating them through the Less-is-More (LIMA) dominance lens. It categorizes approaches into data decomposition, parallelization, memory efficiency, canopy pre-clustering, triangle-inequality pruning, sampling-based initialization, approximation, and hybrid methods, providing a comparative, experiment-backed guide for practitioners. Key findings show that the Big-means algorithm often achieves the best balance of accuracy, speed, and simplicity under LIMA, while fast but less accurate methods like Minibatch K-means may fail to deliver robust quality on diverse big data tasks. The study offers actionable recommendations, including an algorithm selection flowchart, and highlights directions for future research such as integrating modern metaheuristics and automated parameter tuning to further advance true big data MSSC clustering.

Abstract

Paper Structure (51 sections, 3 equations, 8 figures, 26 tables, 11 algorithms)

This paper contains 51 sections, 3 equations, 8 figures, 26 tables, 11 algorithms.

Introduction
Problem Challenges
K-means Algorithm
K-means Optimization Approaches
Data decomposition
Parallelization and distributed computing
Memory-efficient algorithms
Canopy Clustering
Triangle inequality
Sampling-based initialization
Approximation techniques
Hybrid approaches
Summary of clustering techniques
Generalizations, Reflections, and Practical Advices
The imperative of scalability
...and 36 more sections

Figures (8)

Figure 1: Trends in big data publications and citations over the years alongside common data types used in big data clustering.
Figure 2: Ontological graph of the problem area and its main technologies
Figure 3: Histogram of most commonly used approaches and technologies in the field of big data clustering
Figure 4: Timeline of milestones in K-means clustering optimization and other MSSC algorithms
Figure 5: Flowchart of big data clustering algorithm selection
...and 3 more figures

Theorems & Definitions (2)

Definition 6.1
Definition 6.2

Comparative Analysis of Optimization Strategies for K-means Clustering in Big Data Contexts: A Review

TL;DR

Abstract

Comparative Analysis of Optimization Strategies for K-means Clustering in Big Data Contexts: A Review

Authors

TL;DR

Abstract

Table of Contents

Figures (8)

Theorems & Definitions (2)