Table of Contents
Fetching ...

Stable coresets: Unleashing the power of uniform sampling

Amir Carmel, Robert Krauthgamer

TL;DR

This work introduces stable coresets, a middle ground between weak and strong coresets, and proves that uniform sampling yields stable coresets for the 1-median under the $\ell_1$ metric. The main result shows a uniform sample of size $O(\epsilon^{-2}\log d)$ provides a stable $(\epsilon/6,4\epsilon)$-coreset with constant probability, and the framework extends to metrics embedding into $\ell_1$ such as Kendall-tau and Jaccard, enabling $k$-median approximations. The authors develop a general RCDA framework and leverage $\epsilon$-approximations and VC-dimension to connect uniform sampling to stable coresets, allowing new coreset constructions and practical improvements in computation time and accuracy. Empirical results on diverse datasets corroborate the theory, showing fast coreset construction and robust, high-quality approximations, even under fairness constraints and in high dimensions.

Abstract

Uniform sampling is a highly efficient method for data summarization. However, its effectiveness in producing coresets for clustering problems is not yet well understood, primarily because it generally does not yield a strong coreset, which is the prevailing notion in the literature. We formulate \emph{stable coresets}, a notion that is intermediate between the standard notions of weak and strong coresets, and effectively combines the broad applicability of strong coresets with highly efficient constructions, through uniform sampling, of weak coresets. Our main result is that a uniform sample of size $O(ε^{-2}\log d)$ yields, with high constant probability, a stable coreset for $1$-median in $\mathbb{R}^d$ under the $\ell_1$ metric. We then leverage the powerful properties of stable coresets to easily derive new coreset constructions, all through uniform sampling, for $\ell_1$ and related metrics, such as Kendall-tau and Jaccard. We also show applications to fair rank aggregation and to approximation algorithms for $k$-median problem in these metric spaces. Our experiments validate the benefits of stable coresets in practice, in terms of both construction time and approximation quality.

Stable coresets: Unleashing the power of uniform sampling

TL;DR

This work introduces stable coresets, a middle ground between weak and strong coresets, and proves that uniform sampling yields stable coresets for the 1-median under the metric. The main result shows a uniform sample of size provides a stable -coreset with constant probability, and the framework extends to metrics embedding into such as Kendall-tau and Jaccard, enabling -median approximations. The authors develop a general RCDA framework and leverage -approximations and VC-dimension to connect uniform sampling to stable coresets, allowing new coreset constructions and practical improvements in computation time and accuracy. Empirical results on diverse datasets corroborate the theory, showing fast coreset construction and robust, high-quality approximations, even under fairness constraints and in high dimensions.

Abstract

Uniform sampling is a highly efficient method for data summarization. However, its effectiveness in producing coresets for clustering problems is not yet well understood, primarily because it generally does not yield a strong coreset, which is the prevailing notion in the literature. We formulate \emph{stable coresets}, a notion that is intermediate between the standard notions of weak and strong coresets, and effectively combines the broad applicability of strong coresets with highly efficient constructions, through uniform sampling, of weak coresets. Our main result is that a uniform sample of size yields, with high constant probability, a stable coreset for -median in under the metric. We then leverage the powerful properties of stable coresets to easily derive new coreset constructions, all through uniform sampling, for and related metrics, such as Kendall-tau and Jaccard. We also show applications to fair rank aggregation and to approximation algorithms for -median problem in these metric spaces. Our experiments validate the benefits of stable coresets in practice, in terms of both construction time and approximation quality.

Paper Structure

This paper contains 28 sections, 25 theorems, 36 equations, 2 figures, 1 table.

Key Result

Theorem 1.4

Let $P\subset\mathbb{R}^d$ be finite and let $\epsilon \in (0,\frac{1}{5})$. Then, a uniform sample of size $O(\epsilon^{-2} \log d)$ from $P$ is a stable $(\epsilon/6, 4\epsilon)$-coreset for $1$-median in $\ell_1^d$ with probability at least $4/5$.

Figures (2)

  • Figure 1: Tradeoff between coreset size and relative error, comparing importance sampling-based coresets with uniform sampling-based coresets across three datasets. Shaded regions represent one standard deviation.
  • Figure :

Theorems & Definitions (40)

  • Definition 1.1: Weak Coreset
  • Definition 1.2: Strong Coreset
  • Definition 1.3: Stable Coreset
  • Theorem 1.4
  • Proposition 2.0
  • Proposition 2.0
  • Theorem 3.1
  • Theorem 4.1
  • Definition 4.1: VC dimension vapnik1971uniform
  • Proposition 4.1
  • ...and 30 more