Faster and Simpler Greedy Algorithm for $k$-Median and $k$-Means

Max Dupré la Tour; David Saulpic

Faster and Simpler Greedy Algorithm for $k$-Median and $k$-Means

Max Dupré la Tour, David Saulpic

TL;DR

This paper tackles fast approximations for $k$-means, $k$-median, and more generally $(k,z)$-clustering by refining a recursive greedy framework originally due to Mettu and Plaxton. It introduces a simplification that replaces the original ball-value with $Value(B(x,r)) oughly r^z \,|B(x,r)|$, and supports approximate ball neighborhoods via $N(x,r)$, enabling near-linear or almost-linear time implementations in Euclidean spaces and sparse graphs. The main contributions are a poly$(c)$-approximation with explicit running-time bounds, plus practical Euclidean and graph-specific instantiations: near-linear time via quadtrees, constant-factor via LSH, and near-linear ball-counting in graphs using probabilistic partitions and Cohen-style sketches. These results yield scalable, incremental/online seeding procedures that maintain strong guarantees for $(k,z)$-clustering, including an online variant where prefixes provide good approximations. The work thus advances practical greedy approaches for core clustering objectives while clarifying their algorithmic structure and runtime trade-offs.

Abstract

Clustering problems such as $k$-means and $k$-median are staples of unsupervised learning, and many algorithmic techniques have been developed to tackle their numerous aspects. In this paper, we focus on the class of greedy approximation algorithm, that attracted less attention than local-search or primal-dual counterparts. In particular, we study the recursive greedy algorithm developed by Mettu and Plaxton [SIAM J. Comp 2003]. We provide a simplification of the algorithm, allowing for faster implementation, in graph metrics or in Euclidean space, where our algorithm matches or improves the state-of-the-art.

Faster and Simpler Greedy Algorithm for $k$-Median and $k$-Means

TL;DR

This paper tackles fast approximations for

-means,

-median, and more generally

-clustering by refining a recursive greedy framework originally due to Mettu and Plaxton. It introduces a simplification that replaces the original ball-value with

, and supports approximate ball neighborhoods via

, enabling near-linear or almost-linear time implementations in Euclidean spaces and sparse graphs. The main contributions are a poly

-approximation with explicit running-time bounds, plus practical Euclidean and graph-specific instantiations: near-linear time via quadtrees, constant-factor via LSH, and near-linear ball-counting in graphs using probabilistic partitions and Cohen-style sketches. These results yield scalable, incremental/online seeding procedures that maintain strong guarantees for

-clustering, including an online variant where prefixes provide good approximations. The work thus advances practical greedy approaches for core clustering objectives while clarifying their algorithmic structure and runtime trade-offs.

Abstract

Clustering problems such as

-means and

-median are staples of unsupervised learning, and many algorithmic techniques have been developed to tackle their numerous aspects. In this paper, we focus on the class of greedy approximation algorithm, that attracted less attention than local-search or primal-dual counterparts. In particular, we study the recursive greedy algorithm developed by Mettu and Plaxton [SIAM J. Comp 2003]. We provide a simplification of the algorithm, allowing for faster implementation, in graph metrics or in Euclidean space, where our algorithm matches or improves the state-of-the-art.

Paper Structure (24 sections, 23 theorems, 37 equations, 3 algorithms)

This paper contains 24 sections, 23 theorems, 37 equations, 3 algorithms.

Introduction
Our contribution
Further related work
Preliminaries
The Greedy Algorithm
Description of the original algorithm
Simplification and extension
Analysis
Running time analysis of the Center Selection loop
Implementation in the Euclidean setting
Near-linear time approximation via multiple quadtrees
Constant-factor approximation via Locality-sensitive hashing:
Almost-linear time implementation for Graphs
Proof of Correctness
The easy case, dealing with $\Gamma_0$:
...and 9 more sections

Key Result

Theorem 1.1

Let $(P, \text{dist})$ be a metric space with aspect-ratio $\Delta$,The aspect-ratio is the ratio between the largest distance and the smallest non-zero distance in the metric. and $c > 5$ be a constant. Suppose there is: Then the recursive greedy algorithm can be implemented such that it is a $\mathop{\mathrm{poly}}\limits(c)$- approximation and has running time $T_{\mathop{\mathrm{Value}}\limit

Theorems & Definitions (40)

Theorem 1.1: see \ref{['thm:correctness']} and \ref{['thm:runningtime']}
Corollary 1.2
Lemma 2.1
proof
Theorem 3.1: MPOnlineMedian
Theorem 3.2
Theorem 3.3
proof
Lemma 3.5
proof
...and 30 more

Faster and Simpler Greedy Algorithm for $k$-Median and $k$-Means

TL;DR

Abstract

Faster and Simpler Greedy Algorithm for $k$-Median and $k$-Means

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (40)