Table of Contents
Fetching ...

Coreset for Robust Geometric Median: Eliminating Size Dependency on Outliers

Ziyi Fang, Lingxiao Huang, Runkai Yang

TL;DR

This work delivers a new understanding of coresets for robust geometric median problems by removing the outlier-count dependence in coreset size when $n\ge 4m$, achieving a tight 1D bound of $\\tilde{O}(\\varepsilon^{-1/2}+\frac{m}{n}\\varepsilon^{-1})$ and a general $d$-dimensional bound of $\\tilde{O}(\\varepsilon^{-2}\\min\\{\\varepsilon^{-2},d\\})$. The authors introduce a novel non-component-wise error analysis that reduces outlier influence, complemented by ball-range-space techniques to handle high dimensions and extensions to robust $(k,z)$-clustering across various metric spaces. They also provide comprehensive empirical evidence showing improved size-accuracy tradeoffs and faster runtimes relative to baselines, even when data assumptions are challenged. The results have practical impact for large-scale robust clustering tasks, offering scalable coresets that preserve robust cost structure across centers and cluster configurations. Overall, the paper advances both theory and practice in robust coreset design, with clear implications for streaming extensions and broader robust learning problems.

Abstract

We study the robust geometric median problem in Euclidean space $\mathbb{R}^d$, with a focus on coreset construction.A coreset is a compact summary of a dataset $P$ of size $n$ that approximates the robust cost for all centers $c$ within a multiplicative error $\varepsilon$. Given an outlier count $m$, we construct a coreset of size $\tilde{O}(\varepsilon^{-2} \cdot \min\{\varepsilon^{-2}, d\})$ when $n \geq 4m$, eliminating the $O(m)$ dependency present in prior work [Huang et al., 2022 & 2023]. For the special case of $d = 1$, we achieve an optimal coreset size of $\tildeΘ(\varepsilon^{-1/2} + \frac{m}{n} \varepsilon^{-1})$, revealing a clear separation from the vanilla case studied in [Huang et al., 2023; Afshani and Chris, 2024]. Our results further extend to robust $(k,z)$-clustering in various metric spaces, eliminating the $m$-dependence under mild data assumptions. The key technical contribution is a novel non-component-wise error analysis, enabling substantial reduction of outlier influence, unlike prior methods that retain them.Empirically, our algorithms consistently outperform existing baselines in terms of size-accuracy tradeoffs and runtime, even when data assumptions are violated across a wide range of datasets.

Coreset for Robust Geometric Median: Eliminating Size Dependency on Outliers

TL;DR

This work delivers a new understanding of coresets for robust geometric median problems by removing the outlier-count dependence in coreset size when , achieving a tight 1D bound of and a general -dimensional bound of . The authors introduce a novel non-component-wise error analysis that reduces outlier influence, complemented by ball-range-space techniques to handle high dimensions and extensions to robust -clustering across various metric spaces. They also provide comprehensive empirical evidence showing improved size-accuracy tradeoffs and faster runtimes relative to baselines, even when data assumptions are challenged. The results have practical impact for large-scale robust clustering tasks, offering scalable coresets that preserve robust cost structure across centers and cluster configurations. Overall, the paper advances both theory and practice in robust coreset design, with clear implications for streaming extensions and broader robust learning problems.

Abstract

We study the robust geometric median problem in Euclidean space , with a focus on coreset construction.A coreset is a compact summary of a dataset of size that approximates the robust cost for all centers within a multiplicative error . Given an outlier count , we construct a coreset of size when , eliminating the dependency present in prior work [Huang et al., 2022 & 2023]. For the special case of , we achieve an optimal coreset size of , revealing a clear separation from the vanilla case studied in [Huang et al., 2023; Afshani and Chris, 2024]. Our results further extend to robust -clustering in various metric spaces, eliminating the -dependence under mild data assumptions. The key technical contribution is a novel non-component-wise error analysis, enabling substantial reduction of outlier influence, unlike prior methods that retain them.Empirically, our algorithms consistently outperform existing baselines in terms of size-accuracy tradeoffs and runtime, even when data assumptions are violated across a wide range of datasets.

Paper Structure

This paper contains 85 sections, 30 theorems, 94 equations, 11 figures, 8 tables, 3 algorithms.

Key Result

Theorem 1.1

Let $0 < \varepsilon < 0.5$ and $n > m \geq 1$. There exists a dataset $P \subset \mathbb{R}$ of size $n$ such that any $\varepsilon$-coreset of $P$ for the robust geometric median problem must have size $\Omega(\frac{m}{n-m})$.

Figures (11)

  • Figure 1: Illustration of the block partition. The blue square marks the optimal solution $c^\star$, and the blue triangle $c_L$ denotes the left boundary of the inlier set $P_I^\star$, with distance $r_{\max} = \mathrm{dist}(c_L, c^\star)$. Figure \ref{['sub_left_block']} partitions the one-dimensional space left of $P_M$ into disjoint blocks based on each point’s position relative to $c_L$: points farther than $r_{\max}$ form $B_{\mathrm{far}}$, and those within $2\varepsilon r_{\max}$ form $B_0$. Figure \ref{['sub_right_block']} shows the logarithmic subdivision of inner blocks $B_i^{(L)}$ within distance $r_{\max}$ from $c_L$.
  • Figure 2: A case for demonstrating the coreset lower bound for robust 1D geometric median. $T_i$ contains $\lfloor\frac{m}{q}\rfloor$ points where each point $p \in T_i$ satisfies $p=m^{i\alpha}$. $T_0$ contains the remaining points where each point $p \in T_0$ satisfies $p=0$.
  • Figure 3: Tradeoff between coreset size $|S|$ and empirical error $\widehat{\varepsilon}(S)$.
  • Figure 4: Tradoff between coreset size $|S|$ and empirical error $\widehat{\varepsilon}(S)$ for robust geometric median when we set $m=n/2$. In this scenario, the assumption $n \geq 4m$ is violated.
  • Figure 5: Tradoff between coreset size $|S|$ and empirical error $\widehat{\varepsilon}(S)$ for robust geometric median when we perturb $10\%$ points.
  • ...and 6 more figures

Theorems & Definitions (61)

  • Theorem 1.1: Coreset lower bound for robust geometric median
  • Theorem 1.2: Optimal coreset for robust 1D geometric median
  • Theorem 1.3: Coreset for robust geometric median in $\mathbb{R}^d$
  • Theorem 1.5: Coreset for robust $(k,z)$-clustering
  • Definition 2.1: Cost function for weighted dataset
  • Definition 2.2: Coreset for robust geometric median
  • Definition 3.1: Bucket and associated statistics
  • Lemma 3.1: Error analysis for buckets har2005smaller
  • Theorem 3.2: Coreset for vanilla 1D geometric median huang2023small
  • Lemma 3.3: Location of $P_I^{(c)}$
  • ...and 51 more