The Catastrophic Failure of The k-Means Algorithm in High Dimensions, and How Hartigan's Algorithm Avoids It
Roy R. Lederman, David Silva-Sánchez, Ziling Chen, Gilles Mordant, Amnon Balanov, Tamir Bendory
TL;DR
This work analyzes k-means clustering in high-dimensional, high-noise settings under a two-component Gaussian mixture; it proves a catastrophic fixed-point proliferation for Lloyd's algorithm, causing it to stall after the first update, while Hartigan's greedy updates avoid such fixed points and recover the correct clustering w.h.p. The results are complemented by extensive numerical experiments showing Lloyd's failure region and Hartigan's robust performance, often rivaling spectral and SDP methods. The findings clarify when and why Lloyd's method struggles and advocate using Hartigan's algorithm in high dimensions, with implications for EM-like methods as well.
Abstract
Lloyd's k-means algorithm is one of the most widely used clustering methods. We prove that in high-dimensional, high-noise settings, the algorithm exhibits catastrophic failure: with high probability, essentially every partition of the data is a fixed point. Consequently, Lloyd's algorithm simply returns its initial partition - even when the underlying clusters are trivially recoverable by other methods. In contrast, we prove that Hartigan's k-means algorithm does not exhibit this pathology. Our results show the stark difference between these algorithms and offer a theoretical explanation for the empirical difficulties often observed with k-means in high dimensions.
