Coreset for Robust Geometric Median: Eliminating Size Dependency on Outliers
Ziyi Fang, Lingxiao Huang, Runkai Yang
TL;DR
This work delivers a new understanding of coresets for robust geometric median problems by removing the outlier-count dependence in coreset size when $n\ge 4m$, achieving a tight 1D bound of $\\tilde{O}(\\varepsilon^{-1/2}+\frac{m}{n}\\varepsilon^{-1})$ and a general $d$-dimensional bound of $\\tilde{O}(\\varepsilon^{-2}\\min\\{\\varepsilon^{-2},d\\})$. The authors introduce a novel non-component-wise error analysis that reduces outlier influence, complemented by ball-range-space techniques to handle high dimensions and extensions to robust $(k,z)$-clustering across various metric spaces. They also provide comprehensive empirical evidence showing improved size-accuracy tradeoffs and faster runtimes relative to baselines, even when data assumptions are challenged. The results have practical impact for large-scale robust clustering tasks, offering scalable coresets that preserve robust cost structure across centers and cluster configurations. Overall, the paper advances both theory and practice in robust coreset design, with clear implications for streaming extensions and broader robust learning problems.
Abstract
We study the robust geometric median problem in Euclidean space $\mathbb{R}^d$, with a focus on coreset construction.A coreset is a compact summary of a dataset $P$ of size $n$ that approximates the robust cost for all centers $c$ within a multiplicative error $\varepsilon$. Given an outlier count $m$, we construct a coreset of size $\tilde{O}(\varepsilon^{-2} \cdot \min\{\varepsilon^{-2}, d\})$ when $n \geq 4m$, eliminating the $O(m)$ dependency present in prior work [Huang et al., 2022 & 2023]. For the special case of $d = 1$, we achieve an optimal coreset size of $\tildeΘ(\varepsilon^{-1/2} + \frac{m}{n} \varepsilon^{-1})$, revealing a clear separation from the vanilla case studied in [Huang et al., 2023; Afshani and Chris, 2024]. Our results further extend to robust $(k,z)$-clustering in various metric spaces, eliminating the $m$-dependence under mild data assumptions. The key technical contribution is a novel non-component-wise error analysis, enabling substantial reduction of outlier influence, unlike prior methods that retain them.Empirically, our algorithms consistently outperform existing baselines in terms of size-accuracy tradeoffs and runtime, even when data assumptions are violated across a wide range of datasets.
