Coresets for Robust Clustering via Black-box Reductions to Vanilla Case

Shaofeng H. -C. Jiang; Jianing Lou

Coresets for Robust Clustering via Black-box Reductions to Vanilla Case

Shaofeng H. -C. Jiang, Jianing Lou

TL;DR

This work studies robust $(k,z)$-Clustering with $m$ outliers by designing ε-coresets via black-box reductions to vanilla clustering. It develops two reductions: Reduction I uses a density-based almost-dense decomposition to obtain an additive term $A_1 = O_z(kmoldsymbol{ε}^{-1})$ and a coreset size bound tied to the vanilla coreset $N(d,k,oldsymbol{ε}^{-1})$, while Reduction II produces a size-preserving vanilla coreset, yielding an additive term $A_2 = O_z(moldsymbol{ε}^{-2z} ext{log}^z(kmoldsymbol{ε}^{-1}))$. Together, these give coreset sizes of near-linear in $k$ (and polylog factors in $kmoldsymbol{ε}^{-1}$) across several metric spaces, including Euclidean, doubling, finite metrics, and certain graph metrics via a separated-duplication framework. The approach also enables dynamic streaming implementations, delivering the first streaming algorithms for $k$-Median and $k$-Means with $m$ outliers in grids, with space scaling as $ ilde{O}((k+m) ext{poly}(doldsymbol{ε}^{-1}) ext{log}oldsymbol{Δ})$. By separating the analysis into whether datasets are dense or require size-preserving reductions, the paper uncovers the fundamentalTradeoffs in the price of robustness for coresets and offers a general, deterministic reduction framework that leverages vanilla coreset results in broad settings.

Abstract

We devise $ε$-coresets for robust $(k,z)$-Clustering with $m$ outliers through black-box reductions to vanilla case. Given an $ε$-coreset construction for vanilla clustering with size $N$, we construct coresets of size $N\cdot \mathrm{poly}\log(kmε^{-1}) + O_z\left(\min\{kmε^{-1}, mε^{-2z}\log^z(kmε^{-1}) \}\right)$ for various metric spaces, where $O_z$ hides $2^{O(z\log z)}$ factors. This increases the size of the vanilla coreset by a small multiplicative factor of $\mathrm{poly}\log(kmε^{-1})$, and the additive term is up to a $(ε^{-1}\log (km))^{O(z)}$ factor to the size of the optimal robust coreset. Plugging in vanilla coreset results of [Cohen-Addad et al., STOC'21], we obtain the first coresets for $(k,z)$-Clustering with $m$ outliers with size near-linear in $k$ while previous results have size at least $Ω(k^2)$ [Huang et al., ICLR'23; Huang et al., SODA'25]. Technically, we establish two conditions under which a vanilla coreset is as well a robust coreset. The first condition requires the dataset to satisfy special structures - it can be broken into "dense" parts with bounded diameter. We combine this with a new bounded-diameter decomposition that has only $O_z(km ε^{-1})$ non-dense points to obtain the $O_z(km ε^{-1})$ additive bound. Another condition requires the vanilla coreset to possess an extra size-preserving property. We further give a black-box reduction that turns a vanilla coreset to the one satisfying the said size-preserving property, leading to the alternative $O_z(mε^{-2z}\log^{z}(kmε^{-1}))$ additive bound. We also implement our reductions in the dynamic streaming setting and obtain the first streaming algorithms for $k$-Median and $k$-Means with $m$ outliers, using space $\tilde{O}(k+m)\cdot\mathrm{poly}(dε^{-1}\logΔ)$ for inputs on the grid $[Δ]^d$.

Coresets for Robust Clustering via Black-box Reductions to Vanilla Case

TL;DR

This work studies robust

-Clustering with

outliers by designing ε-coresets via black-box reductions to vanilla clustering. It develops two reductions: Reduction I uses a density-based almost-dense decomposition to obtain an additive term

and a coreset size bound tied to the vanilla coreset

, while Reduction II produces a size-preserving vanilla coreset, yielding an additive term

. Together, these give coreset sizes of near-linear in

(and polylog factors in

) across several metric spaces, including Euclidean, doubling, finite metrics, and certain graph metrics via a separated-duplication framework. The approach also enables dynamic streaming implementations, delivering the first streaming algorithms for

-Median and

-Means with

outliers in grids, with space scaling as

. By separating the analysis into whether datasets are dense or require size-preserving reductions, the paper uncovers the fundamentalTradeoffs in the price of robustness for coresets and offers a general, deterministic reduction framework that leverages vanilla coreset results in broad settings.

Abstract

We devise

-coresets for robust

-Clustering with

outliers through black-box reductions to vanilla case. Given an

-coreset construction for vanilla clustering with size

, we construct coresets of size

for various metric spaces, where

hides

factors. This increases the size of the vanilla coreset by a small multiplicative factor of

, and the additive term is up to a

factor to the size of the optimal robust coreset. Plugging in vanilla coreset results of [Cohen-Addad et al., STOC'21], we obtain the first coresets for

-Clustering with

outliers with size near-linear in

while previous results have size at least

[Huang et al., ICLR'23; Huang et al., SODA'25]. Technically, we establish two conditions under which a vanilla coreset is as well a robust coreset. The first condition requires the dataset to satisfy special structures - it can be broken into "dense" parts with bounded diameter. We combine this with a new bounded-diameter decomposition that has only

non-dense points to obtain the

additive bound. Another condition requires the vanilla coreset to possess an extra size-preserving property. We further give a black-box reduction that turns a vanilla coreset to the one satisfying the said size-preserving property, leading to the alternative

additive bound. We also implement our reductions in the dynamic streaming setting and obtain the first streaming algorithms for

-Median and

-Means with

outliers, using space

for inputs on the grid

Coresets for Robust Clustering via Black-box Reductions to Vanilla Case

TL;DR

Abstract

Coresets for Robust Clustering via Black-box Reductions to Vanilla Case

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (72)