Table of Contents
Fetching ...

The Catastrophic Failure of The k-Means Algorithm in High Dimensions, and How Hartigan's Algorithm Avoids It

Roy R. Lederman, David Silva-Sánchez, Ziling Chen, Gilles Mordant, Amnon Balanov, Tamir Bendory

TL;DR

This work analyzes k-means clustering in high-dimensional, high-noise settings under a two-component Gaussian mixture; it proves a catastrophic fixed-point proliferation for Lloyd's algorithm, causing it to stall after the first update, while Hartigan's greedy updates avoid such fixed points and recover the correct clustering w.h.p. The results are complemented by extensive numerical experiments showing Lloyd's failure region and Hartigan's robust performance, often rivaling spectral and SDP methods. The findings clarify when and why Lloyd's method struggles and advocate using Hartigan's algorithm in high dimensions, with implications for EM-like methods as well.

Abstract

Lloyd's k-means algorithm is one of the most widely used clustering methods. We prove that in high-dimensional, high-noise settings, the algorithm exhibits catastrophic failure: with high probability, essentially every partition of the data is a fixed point. Consequently, Lloyd's algorithm simply returns its initial partition - even when the underlying clusters are trivially recoverable by other methods. In contrast, we prove that Hartigan's k-means algorithm does not exhibit this pathology. Our results show the stark difference between these algorithms and offer a theoretical explanation for the empirical difficulties often observed with k-means in high dimensions.

The Catastrophic Failure of The k-Means Algorithm in High Dimensions, and How Hartigan's Algorithm Avoids It

TL;DR

This work analyzes k-means clustering in high-dimensional, high-noise settings under a two-component Gaussian mixture; it proves a catastrophic fixed-point proliferation for Lloyd's algorithm, causing it to stall after the first update, while Hartigan's greedy updates avoid such fixed points and recover the correct clustering w.h.p. The results are complemented by extensive numerical experiments showing Lloyd's failure region and Hartigan's robust performance, often rivaling spectral and SDP methods. The findings clarify when and why Lloyd's method struggles and advocate using Hartigan's algorithm in high dimensions, with implications for EM-like methods as well.

Abstract

Lloyd's k-means algorithm is one of the most widely used clustering methods. We prove that in high-dimensional, high-noise settings, the algorithm exhibits catastrophic failure: with high probability, essentially every partition of the data is a fixed point. Consequently, Lloyd's algorithm simply returns its initial partition - even when the underlying clusters are trivially recoverable by other methods. In contrast, we prove that Hartigan's k-means algorithm does not exhibit this pathology. Our results show the stark difference between these algorithms and offer a theoretical explanation for the empirical difficulties often observed with k-means in high dimensions.
Paper Structure (58 sections, 25 theorems, 118 equations, 9 figures, 1 table, 2 algorithms)

This paper contains 58 sections, 25 theorems, 118 equations, 9 figures, 1 table, 2 algorithms.

Key Result

Theorem 1.1

Consider $n \in \mathbb{N}$ observed samples $x_1,\dots,x_n$ from a two-cluster ($K=2$) Gaussian mixture model in $\mathbb{R}^d$, with standard normally distributed means $\mu_1^\star,\mu_2^\star\in\mathbb{R}^d$ and isotropic noise covariance $\sigma^2 I_d$. Let $\mathcal{F}_{\mathrm{Lloyd}}$ denote which yield the contrasting behaviors as $d,n\to\infty$: where $a\lesssim b$ means that there exis

Figures (9)

  • Figure 1: Normalized mutual information (NMI; see Definition \ref{['def:preliminaries:nmi']}) between the ground-truth partition and the output of each clustering algorithm. Each entry reports the mean over $100$ independent trials: in each trial, we sample data from the Gaussian mixture model (GMM) in Model \ref{['model:gmm2']} (generalized to $K\geq 2$) with $\tau^2=1.0$ and $20$ samples per class, and run each algorithm until convergence. The results illustrate that in the high-noise, high-dimensional regime, Lloyd’s $k$-means performs poorly relative to the other methods. In contrast, Hartigan’s algorithm achieves performance comparable to spectral clustering and semidefinite-programming (SDP) based clustering. See Section \ref{['sec:results:gmm']} for details.
  • Figure 2: Comparison of the $k$-Means Win Rate (see Section \ref{['sec:additional_results:loss_metric']}) obtained with each algorithm for the synthetic GMM dataset for different values of $K$. Each value corresponds to the average of 100 independent experiments, where, for each instance, we sample data from the Gaussian mixture model defined in Model \ref{['model:gmm2']} (generalized to $K\geq 2$) with $\tau^2=1$ and 20 samples per class, and run each algorithm until convergence.
  • Figure 3: Normalized Mutual Information (NMI) between ground-truth clusters and the clusters obtained from Lloyd's and Hartigan's $k$-means. An NMI value of 1 indicates perfect correlation, while a value of 0 signifies no mutual information between two assignments. Each value corresponds to the average of 100 independent experiments, where, for each instance, we sample 40 samples from the GMM defined in Model \ref{['model:gmm2']} (generalized to $K\geq 2$) with $\tau^2 = 1$ and equally sized clusters. This Figure shows that Lloyd's $k$-means is more sensitive to the initial centers as the data dimension increases, whereas Hartigan's $k$-means is less sensitive. Details for each initialization strategy are available in Section \ref{['sec:results']}.
  • Figure 4: $k$-Means Win Rate (see Section \ref{['sec:additional_results:loss_metric']}) comparison of Lloyd's and Hartigan's $k$-means for different initialization strategies and number of classes, $K$. Each value corresponds to the average of 100 independent experiments where, for each instance, we sample data from the Gaussian mixture model defined in Model \ref{['model:gmm2']} (generalized to $K\geq 2$) with $\tau^2=1$ and 20 samples per class, and run each algorithm until convergence.
  • Figure 5: Number of iterations performed by Lloyd's $k$-means for different initialization strategies and number of classes, $K$. Each value corresponds to the average of 100 independent experiments, where, for each instance, we sample data from the Gaussian mixture model defined in Model \ref{['model:gmm2']} (generalized to $K\geq 2$) with $\tau^2=1$ and 20 samples per class, and run Lloyd's $k$-means until convergence. We observe that, in the case of random partition initialization, we exclude the scenario in which the initialization itself might constitute a fixed point. Consequently, the number of iterations will be 1 even if the partition remains unchanged after the first iteration.
  • ...and 4 more figures

Theorems & Definitions (47)

  • Theorem 1.1: Informal: high-noise, high-dimensional, finite-sample behavior of Lloyd vs. Hartigan
  • Definition 2.2: Current assignment, clusters, and partition
  • Definition 2.3: Centroids
  • Definition 2.4: Class proportions, purity, and correctness
  • Definition 2.5: $q$-approximately balanced partitions
  • Lemma 3.0: Distance to the current cluster centroid
  • Lemma 3.0: Distance to the other cluster centroid
  • Lemma 3.0
  • Theorem 3.1: Lloyd's algorithm: single sample
  • Remark 3.2
  • ...and 37 more