Local Search for Clustering in Almost-linear Time
Shaofeng H. -C. Jiang, Yaonan Jin, Jianing Lou, Pinyan Lu
TL;DR
This work introduces a novel local-search framework for clustering that achieves an $O(1)$-approximation in almost-linear time, notably for Euclidean $k$-Means where the algorithm runs in $\tilde{O}(d\,n^{1+1/c})$ for any constant $c\ge1$. Central to the approach is a 1-swap local search with a super-effective swap selection rule that ties objective improvement to clustering recourse, enabling near-linear total recomputation via a carefully designed clustering oracle and a dynamic approximate nearest neighbor structure. The authors extend the framework beyond Euclidean spaces by reducing clustering on graphs to shortest-path metrics and then to sparse metric spanners, yielding $O(c)$-approximation results in $\tilde{O}(n^{1+3\rho})$ time for spaces with LSH-based or doubling-spanner constructions. The paper also provides a general analysis of local-search on graphs, including explicit constants for $k$-Means and $k$-Median, and demonstrates applicability to a broad class of metric spaces (Euclidean, $\ell_p$, Jaccard, doubling) via spanners, with a unified framework that balances accuracy and near-linear efficiency.
Abstract
We propose the first \emph{local search} algorithm for Euclidean clustering that attains an $O(1)$-approximation in almost-linear time. Specifically, for Euclidean $k$-Means, our algorithm achieves an $O(c)$-approximation in $\tilde{O}(n^{1 + 1 / c})$ time, for any constant $c \ge 1$, maintaining the same running time as the previous (non-local-search-based) approach [la Tour and Saulpic, arXiv'2407.11217] while improving the approximation factor from $O(c^{6})$ to $O(c)$. The algorithm generalizes to any metric space with sparse spanners, delivering efficient constant approximation in $\ell_p$ metrics, doubling metrics, Jaccard metrics, etc. This generality derives from our main technical contribution: a local search algorithm on general graphs that obtains an $O(1)$-approximation in almost-linear time. We establish this through a new $1$-swap local search framework featuring a novel swap selection rule. At a high level, this rule ``scores'' every possible swap, based on both its modification to the clustering and its improvement to the clustering objective, and then selects those high-scoring swaps. To implement this, we design a new data structure for maintaining approximate nearest neighbors with amortized guarantees tailored to our framework.
