Table of Contents
Fetching ...

Coresets for Robust Clustering via Black-box Reductions to Vanilla Case

Shaofeng H. -C. Jiang, Jianing Lou

TL;DR

This work studies robust $(k,z)$-Clustering with $m$ outliers by designing ε-coresets via black-box reductions to vanilla clustering. It develops two reductions: Reduction I uses a density-based almost-dense decomposition to obtain an additive term $A_1 = O_z(kmoldsymbol{ε}^{-1})$ and a coreset size bound tied to the vanilla coreset $N(d,k,oldsymbol{ε}^{-1})$, while Reduction II produces a size-preserving vanilla coreset, yielding an additive term $A_2 = O_z(moldsymbol{ε}^{-2z} ext{log}^z(kmoldsymbol{ε}^{-1}))$. Together, these give coreset sizes of near-linear in $k$ (and polylog factors in $kmoldsymbol{ε}^{-1}$) across several metric spaces, including Euclidean, doubling, finite metrics, and certain graph metrics via a separated-duplication framework. The approach also enables dynamic streaming implementations, delivering the first streaming algorithms for $k$-Median and $k$-Means with $m$ outliers in grids, with space scaling as $ ilde{O}((k+m) ext{poly}(doldsymbol{ε}^{-1}) ext{log}oldsymbol{Δ})$. By separating the analysis into whether datasets are dense or require size-preserving reductions, the paper uncovers the fundamentalTradeoffs in the price of robustness for coresets and offers a general, deterministic reduction framework that leverages vanilla coreset results in broad settings.

Abstract

We devise $ε$-coresets for robust $(k,z)$-Clustering with $m$ outliers through black-box reductions to vanilla case. Given an $ε$-coreset construction for vanilla clustering with size $N$, we construct coresets of size $N\cdot \mathrm{poly}\log(kmε^{-1}) + O_z\left(\min\{kmε^{-1}, mε^{-2z}\log^z(kmε^{-1}) \}\right)$ for various metric spaces, where $O_z$ hides $2^{O(z\log z)}$ factors. This increases the size of the vanilla coreset by a small multiplicative factor of $\mathrm{poly}\log(kmε^{-1})$, and the additive term is up to a $(ε^{-1}\log (km))^{O(z)}$ factor to the size of the optimal robust coreset. Plugging in vanilla coreset results of [Cohen-Addad et al., STOC'21], we obtain the first coresets for $(k,z)$-Clustering with $m$ outliers with size near-linear in $k$ while previous results have size at least $Ω(k^2)$ [Huang et al., ICLR'23; Huang et al., SODA'25]. Technically, we establish two conditions under which a vanilla coreset is as well a robust coreset. The first condition requires the dataset to satisfy special structures - it can be broken into "dense" parts with bounded diameter. We combine this with a new bounded-diameter decomposition that has only $O_z(km ε^{-1})$ non-dense points to obtain the $O_z(km ε^{-1})$ additive bound. Another condition requires the vanilla coreset to possess an extra size-preserving property. We further give a black-box reduction that turns a vanilla coreset to the one satisfying the said size-preserving property, leading to the alternative $O_z(mε^{-2z}\log^{z}(kmε^{-1}))$ additive bound. We also implement our reductions in the dynamic streaming setting and obtain the first streaming algorithms for $k$-Median and $k$-Means with $m$ outliers, using space $\tilde{O}(k+m)\cdot\mathrm{poly}(dε^{-1}\logΔ)$ for inputs on the grid $[Δ]^d$.

Coresets for Robust Clustering via Black-box Reductions to Vanilla Case

TL;DR

This work studies robust -Clustering with outliers by designing ε-coresets via black-box reductions to vanilla clustering. It develops two reductions: Reduction I uses a density-based almost-dense decomposition to obtain an additive term and a coreset size bound tied to the vanilla coreset , while Reduction II produces a size-preserving vanilla coreset, yielding an additive term . Together, these give coreset sizes of near-linear in (and polylog factors in ) across several metric spaces, including Euclidean, doubling, finite metrics, and certain graph metrics via a separated-duplication framework. The approach also enables dynamic streaming implementations, delivering the first streaming algorithms for -Median and -Means with outliers in grids, with space scaling as . By separating the analysis into whether datasets are dense or require size-preserving reductions, the paper uncovers the fundamentalTradeoffs in the price of robustness for coresets and offers a general, deterministic reduction framework that leverages vanilla coreset results in broad settings.

Abstract

We devise -coresets for robust -Clustering with outliers through black-box reductions to vanilla case. Given an -coreset construction for vanilla clustering with size , we construct coresets of size for various metric spaces, where hides factors. This increases the size of the vanilla coreset by a small multiplicative factor of , and the additive term is up to a factor to the size of the optimal robust coreset. Plugging in vanilla coreset results of [Cohen-Addad et al., STOC'21], we obtain the first coresets for -Clustering with outliers with size near-linear in while previous results have size at least [Huang et al., ICLR'23; Huang et al., SODA'25]. Technically, we establish two conditions under which a vanilla coreset is as well a robust coreset. The first condition requires the dataset to satisfy special structures - it can be broken into "dense" parts with bounded diameter. We combine this with a new bounded-diameter decomposition that has only non-dense points to obtain the additive bound. Another condition requires the vanilla coreset to possess an extra size-preserving property. We further give a black-box reduction that turns a vanilla coreset to the one satisfying the said size-preserving property, leading to the alternative additive bound. We also implement our reductions in the dynamic streaming setting and obtain the first streaming algorithms for -Median and -Means with outliers, using space for inputs on the grid .

Paper Structure

This paper contains 73 sections, 38 theorems, 94 equations, 1 figure, 2 tables, 5 algorithms.

Key Result

Theorem 1.1

Assume there is an algorithm that constructs an $\varepsilon$-coreset for $(k,z)$-Clustering of size $N(d,k,\varepsilon^{-1})$ for any dataset from $\mathop{\mathrm{\mathbb{R}}}\nolimits^d$. Then, there is an algorithm that constructs an $\varepsilon$-coreset for $(k,z,m)$-Clustering of size for any dataset from $\mathop{\mathrm{\mathbb{R}}}\nolimits^d$, where $A_1 = O_z\left(km\varepsilon^{-1}\r

Figures (1)

  • Figure 1: Illustration of the construction of separated duplication of a graph with the weights of edges omitted. On the left, we show an original graph $G$, using a triangle graph as an example. On the right, we demonstrate our construction for the $w$-separated $3$-duplication of $G$, where the three triangle graphs connected by black edges represent the s of $G$, and red edges are weighted by $w$ to satisfy the separation requirement.

Theorems & Definitions (72)

  • Theorem 1.1: Euclidean case
  • Theorem 1.2: Informal version of \ref{['thm:dynamic coresets']}
  • Definition 2.1: Coresets
  • Definition 2.2: Additive-error coresets
  • Definition 2.3: $(\alpha,\beta,\gamma)$-Approximation
  • Lemma 2.4: Generalized triangle inequalities
  • Theorem 3.1
  • Definition 3.2: $\lambda$-Bounded partition
  • Lemma 3.3: Vanilla coresets on dense datasets are robust
  • proof
  • ...and 62 more