Table of Contents
Fetching ...

Comparative Analysis of Optimization Strategies for K-means Clustering in Big Data Contexts: A Review

Ravil Mussabayev, Rustam Mussabayev

TL;DR

This paper surveys optimization strategies for K-means-like clustering in big data, formalizing the MSSC objective as $f(C,X)=\sum_{i=1}^m \min_{j=1,...,k} \|x_i-c_j\|^2$ and evaluating them through the Less-is-More (LIMA) dominance lens. It categorizes approaches into data decomposition, parallelization, memory efficiency, canopy pre-clustering, triangle-inequality pruning, sampling-based initialization, approximation, and hybrid methods, providing a comparative, experiment-backed guide for practitioners. Key findings show that the Big-means algorithm often achieves the best balance of accuracy, speed, and simplicity under LIMA, while fast but less accurate methods like Minibatch K-means may fail to deliver robust quality on diverse big data tasks. The study offers actionable recommendations, including an algorithm selection flowchart, and highlights directions for future research such as integrating modern metaheuristics and automated parameter tuning to further advance true big data MSSC clustering.

Abstract

This paper presents a comparative analysis of different optimization techniques for the K-means algorithm in the context of big data. K-means is a widely used clustering algorithm, but it can suffer from scalability issues when dealing with large datasets. The paper explores different approaches to overcome these issues, including parallelization, approximation, and sampling methods. The authors evaluate the performance of various clustering techniques on a large number of benchmark datasets, comparing them according to the dominance criterion provided by the "less is more" approach (LIMA), i.e., simultaneously along the dimensions of speed, clustering quality, and simplicity. The results show that different techniques are more suitable for different types of datasets and provide insights into the trade-offs between speed and accuracy in K-means clustering for big data. Overall, the paper offers a comprehensive guide for practitioners and researchers on how to optimize K-means for big data applications.

Comparative Analysis of Optimization Strategies for K-means Clustering in Big Data Contexts: A Review

TL;DR

This paper surveys optimization strategies for K-means-like clustering in big data, formalizing the MSSC objective as and evaluating them through the Less-is-More (LIMA) dominance lens. It categorizes approaches into data decomposition, parallelization, memory efficiency, canopy pre-clustering, triangle-inequality pruning, sampling-based initialization, approximation, and hybrid methods, providing a comparative, experiment-backed guide for practitioners. Key findings show that the Big-means algorithm often achieves the best balance of accuracy, speed, and simplicity under LIMA, while fast but less accurate methods like Minibatch K-means may fail to deliver robust quality on diverse big data tasks. The study offers actionable recommendations, including an algorithm selection flowchart, and highlights directions for future research such as integrating modern metaheuristics and automated parameter tuning to further advance true big data MSSC clustering.

Abstract

This paper presents a comparative analysis of different optimization techniques for the K-means algorithm in the context of big data. K-means is a widely used clustering algorithm, but it can suffer from scalability issues when dealing with large datasets. The paper explores different approaches to overcome these issues, including parallelization, approximation, and sampling methods. The authors evaluate the performance of various clustering techniques on a large number of benchmark datasets, comparing them according to the dominance criterion provided by the "less is more" approach (LIMA), i.e., simultaneously along the dimensions of speed, clustering quality, and simplicity. The results show that different techniques are more suitable for different types of datasets and provide insights into the trade-offs between speed and accuracy in K-means clustering for big data. Overall, the paper offers a comprehensive guide for practitioners and researchers on how to optimize K-means for big data applications.
Paper Structure (51 sections, 3 equations, 8 figures, 26 tables, 11 algorithms)

This paper contains 51 sections, 3 equations, 8 figures, 26 tables, 11 algorithms.

Figures (8)

  • Figure 1: Trends in big data publications and citations over the years alongside common data types used in big data clustering.
  • Figure 2: Ontological graph of the problem area and its main technologies
  • Figure 3: Histogram of most commonly used approaches and technologies in the field of big data clustering
  • Figure 4: Timeline of milestones in K-means clustering optimization and other MSSC algorithms
  • Figure 5: Flowchart of big data clustering algorithm selection
  • ...and 3 more figures

Theorems & Definitions (2)

  • Definition 6.1
  • Definition 6.2