Sensitivity Sampling for $k$-Means: Worst Case and Stability Optimal Coreset Bounds
Nikhil Bansal, Vincent Cohen-Addad, Milind Prabhu, David Saulpic, Chris Schwiegelshohn
TL;DR
This work analyzes coresets for center-based clustering, focusing on sensitivity sampling as a simple yet powerful approach. By developing a two-tier analysis—first via a coarse variance/net bound and then through a refined chaining framework that exploits stability—the authors derive optimal worst-case coreset sizes $ ilde{O}(k/oldsymbol\e^2 ext{·} ext{min}(\nobreak \sqrt{k},oldsymbol\e^{-2}))$ for Euclidean $k$-means and a stability-enabled bound of $ ilde{O}(k/oldsymbol\e^2)$ when the input is $eta$-stable with $eta= ext{Ω}(1)$. They also establish a matching lower bound for input-point coresets, extend results to $k$-median and doubling metrics, and show that sensitivity sampling adapts to well-clusterable data without needing to know the stability parameter. Collectively, the results justify sensitivity sampling as the right coreset method for clustering in both worst-case and stable regimes, with broad implications for non-Euclidean metrics and practical data reduction. The methods hinge on a Gaussian-process/chaining analysis that bounds estimator variance across scales while controlling the net sizes, enabling tight tradeoffs between sampling complexity and accuracy.
Abstract
Coresets are arguably the most popular compression paradigm for center-based clustering objectives such as $k$-means. Given a point set $P$, a coreset $Ω$ is a small, weighted summary that preserves the cost of all candidate solutions $S$ up to a $(1\pm \varepsilon)$ factor. For $k$-means in $d$-dimensional Euclidean space the cost for solution $S$ is $\sum_{p\in P}\min_{s\in S}\|p-s\|^2$. A very popular method for coreset construction, both in theory and practice, is Sensitivity Sampling, where points are sampled in proportion to their importance. We show that Sensitivity Sampling yields optimal coresets of size $\tilde{O}(k/\varepsilon^2\min(\sqrt{k},\varepsilon^{-2}))$ for worst-case instances. Uniquely among all known coreset algorithms, for well-clusterable data sets with $Ω(1)$ cost stability, Sensitivity Sampling gives coresets of size $\tilde{O}(k/\varepsilon^2)$, improving over the worst-case lower bound. Notably, Sensitivity Sampling does not have to know the cost stability in order to exploit it: It is appropriately sensitive to the clusterability of the data set while being oblivious to it. We also show that any coreset for stable instances consisting of only input points must have size $Ω(k/\varepsilon^2)$. Our results for Sensitivity Sampling also extend to the $k$-median problem, and more general metric spaces.
