Table of Contents
Fetching ...

Faster Approximation Algorithms for k-Center via Data Reduction

Arnold Filtser, Shaofeng H. -C. Jiang, Yi Li, Anurag Murty Naredla, Ioannis Psarros, Qiaoyuan Yang, Qin Zhang

TL;DR

This work addresses the Euclidean $k$-Center problem in the large-$k$ regime by introducing $\alpha$-coresets, small subsets that preserve approximation guarantees. It delivers two coreset constructions: a near-linear-time construction based on efficient consistent hashing and dimension reduction that yields an $O(\alpha)$-approximation for $k=n^c$ (with $0<c<1$), and a sampling-based method using ANN that provides a second, scalable option. A key technical contribution is a new consistent hashing scheme with parameters $\Gamma=\beta$ and $\Lambda=\mathrm{poly}(d)\exp(O(d/\beta^{2/3}))$, enabling compact coverings in high dimensions. Empirical results show that running Gonzalez on these coresets yields 2x–4x speedups with clustering costs close to those obtained on the full dataset, demonstrating practical viability for large-scale high-dimensional clustering.

Abstract

We study efficient algorithms for the Euclidean $k$-Center problem, focusing on the regime of large $k$. We take the approach of data reduction by considering $α$-coreset, which is a small subset $S$ of the dataset $P$ such that any $β$-approximation on $S$ is an $(α+ β)$-approximation on $P$. We give efficient algorithms to construct coresets whose size is $k \cdot o(n)$, which immediately speeds up existing approximation algorithms. Notably, we obtain a near-linear time $O(1)$-approximation when $k = n^c$ for any $0 < c < 1$. We validate the performance of our coresets on real-world datasets with large $k$, and we observe that the coreset speeds up the well-known Gonzalez algorithm by up to $4$ times, while still achieving similar clustering cost. Technically, one of our coreset results is based on a new efficient construction of consistent hashing with competitive parameters. This general tool may be of independent interest for algorithm design in high dimensional Euclidean spaces.

Faster Approximation Algorithms for k-Center via Data Reduction

TL;DR

This work addresses the Euclidean -Center problem in the large- regime by introducing -coresets, small subsets that preserve approximation guarantees. It delivers two coreset constructions: a near-linear-time construction based on efficient consistent hashing and dimension reduction that yields an -approximation for (with ), and a sampling-based method using ANN that provides a second, scalable option. A key technical contribution is a new consistent hashing scheme with parameters and , enabling compact coverings in high dimensions. Empirical results show that running Gonzalez on these coresets yields 2x–4x speedups with clustering costs close to those obtained on the full dataset, demonstrating practical viability for large-scale high-dimensional clustering.

Abstract

We study efficient algorithms for the Euclidean -Center problem, focusing on the regime of large . We take the approach of data reduction by considering -coreset, which is a small subset of the dataset such that any -approximation on is an -approximation on . We give efficient algorithms to construct coresets whose size is , which immediately speeds up existing approximation algorithms. Notably, we obtain a near-linear time -approximation when for any . We validate the performance of our coresets on real-world datasets with large , and we observe that the coreset speeds up the well-known Gonzalez algorithm by up to times, while still achieving similar clustering cost. Technically, one of our coreset results is based on a new efficient construction of consistent hashing with competitive parameters. This general tool may be of independent interest for algorithm design in high dimensional Euclidean spaces.

Paper Structure

This paper contains 12 sections, 15 theorems, 14 equations, 2 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1.1

For every $\alpha \geq 1$, there exists an $O(\alpha)$-coreset of size $\tilde{O}(k n^{1 / \alpha^{2/3}})$ that can be computed in time $\tilde{O}(n)$ with probability at least $0.99$.

Figures (2)

  • Figure 1: The trade-off between the coreset size and the $k$-Center cost for all baselines in each dataset.
  • Figure 2: The trade-off between the coreset size and the running time for all baselines in each dataset.

Theorems & Definitions (32)

  • Theorem 1.1
  • Theorem 1.2
  • Definition 2.1: Covering
  • Lemma 2.1: Coarse approximation
  • Definition 3.1
  • Lemma 3.2
  • proof
  • Lemma 3.3: aiger2014reporting, Lemma 3.1
  • Theorem 4.1
  • Lemma 4.1
  • ...and 22 more