Table of Contents
Fetching ...

Local Search for Clustering in Almost-linear Time

Shaofeng H. -C. Jiang, Yaonan Jin, Jianing Lou, Pinyan Lu

TL;DR

This work introduces a novel local-search framework for clustering that achieves an $O(1)$-approximation in almost-linear time, notably for Euclidean $k$-Means where the algorithm runs in $\tilde{O}(d\,n^{1+1/c})$ for any constant $c\ge1$. Central to the approach is a 1-swap local search with a super-effective swap selection rule that ties objective improvement to clustering recourse, enabling near-linear total recomputation via a carefully designed clustering oracle and a dynamic approximate nearest neighbor structure. The authors extend the framework beyond Euclidean spaces by reducing clustering on graphs to shortest-path metrics and then to sparse metric spanners, yielding $O(c)$-approximation results in $\tilde{O}(n^{1+3\rho})$ time for spaces with LSH-based or doubling-spanner constructions. The paper also provides a general analysis of local-search on graphs, including explicit constants for $k$-Means and $k$-Median, and demonstrates applicability to a broad class of metric spaces (Euclidean, $\ell_p$, Jaccard, doubling) via spanners, with a unified framework that balances accuracy and near-linear efficiency.

Abstract

We propose the first \emph{local search} algorithm for Euclidean clustering that attains an $O(1)$-approximation in almost-linear time. Specifically, for Euclidean $k$-Means, our algorithm achieves an $O(c)$-approximation in $\tilde{O}(n^{1 + 1 / c})$ time, for any constant $c \ge 1$, maintaining the same running time as the previous (non-local-search-based) approach [la Tour and Saulpic, arXiv'2407.11217] while improving the approximation factor from $O(c^{6})$ to $O(c)$. The algorithm generalizes to any metric space with sparse spanners, delivering efficient constant approximation in $\ell_p$ metrics, doubling metrics, Jaccard metrics, etc. This generality derives from our main technical contribution: a local search algorithm on general graphs that obtains an $O(1)$-approximation in almost-linear time. We establish this through a new $1$-swap local search framework featuring a novel swap selection rule. At a high level, this rule ``scores'' every possible swap, based on both its modification to the clustering and its improvement to the clustering objective, and then selects those high-scoring swaps. To implement this, we design a new data structure for maintaining approximate nearest neighbors with amortized guarantees tailored to our framework.

Local Search for Clustering in Almost-linear Time

TL;DR

This work introduces a novel local-search framework for clustering that achieves an -approximation in almost-linear time, notably for Euclidean -Means where the algorithm runs in for any constant . Central to the approach is a 1-swap local search with a super-effective swap selection rule that ties objective improvement to clustering recourse, enabling near-linear total recomputation via a carefully designed clustering oracle and a dynamic approximate nearest neighbor structure. The authors extend the framework beyond Euclidean spaces by reducing clustering on graphs to shortest-path metrics and then to sparse metric spanners, yielding -approximation results in time for spaces with LSH-based or doubling-spanner constructions. The paper also provides a general analysis of local-search on graphs, including explicit constants for -Means and -Median, and demonstrates applicability to a broad class of metric spaces (Euclidean, , Jaccard, doubling) via spanners, with a unified framework that balances accuracy and near-linear efficiency.

Abstract

We propose the first \emph{local search} algorithm for Euclidean clustering that attains an -approximation in almost-linear time. Specifically, for Euclidean -Means, our algorithm achieves an -approximation in time, for any constant , maintaining the same running time as the previous (non-local-search-based) approach [la Tour and Saulpic, arXiv'2407.11217] while improving the approximation factor from to . The algorithm generalizes to any metric space with sparse spanners, delivering efficient constant approximation in metrics, doubling metrics, Jaccard metrics, etc. This generality derives from our main technical contribution: a local search algorithm on general graphs that obtains an -approximation in almost-linear time. We establish this through a new -swap local search framework featuring a novel swap selection rule. At a high level, this rule ``scores'' every possible swap, based on both its modification to the clustering and its improvement to the clustering objective, and then selects those high-scoring swaps. To implement this, we design a new data structure for maintaining approximate nearest neighbors with amortized guarantees tailored to our framework.

Paper Structure

This paper contains 36 sections, 34 theorems, 110 equations, 3 figures, 1 table.

Key Result

Theorem 1.1

For any constant $c \ge 1$, there is an algorithm that computes an $O(c)$-approximation for Euclidean $k$-Means on a given $n$-point dataset with aspect ratio $\Delta > 0$,The aspect ratio of a dataset is defined as the ratio between the maximum and minimum pairwise distances among the data points.

Figures (3)

  • Figure 1: The operations initialize and initialize-subclusterings.
  • Figure 2: The (sub)operations insert and insert-subclustering.
  • Figure 3: The operation sample-noncenter.

Theorems & Definitions (94)

  • Theorem 1.1: Euclidean $k$-Means; see \ref{['cor:lsh']}
  • Theorem 1.2: $k$-Means on graphs; see \ref{['cor:general-graph']}
  • Claim 1.3: Informal; see \ref{['lem:sample']}
  • Definition 2.1: Hop-boundedness
  • Proposition 2.2: Generalized triangle inequalities MMR19
  • Proposition 2.3: Coarse approximation
  • Proposition 2.5: Removing \ref{['assumption:edge-weight']}
  • Proposition 3.2: Bounded distances
  • Definition 3.3: Isolation set cover
  • Lemma 3.4: Isolation set cover
  • ...and 84 more