Table of Contents
Fetching ...

Geometric Median Matching for Robust k-Subset Selection from Noisy Data

Anish Acharya, Sujay Sanghavi, Alexandros G. Dimakis, Inderjit S Dhillon

TL;DR

This work tackles robust data pruning for training large models by addressing the fragility of mean-based subset selection under arbitrary data corruption. The authors introduce Geometric Median (GM) Matching, which replaces the empirical mean with the geometric median in a kernel-based moment-matching framework and solves a greedy, kernel-inspired subset selection procedure. They provide theoretical guarantees that GM Matching converges to a neighborhood of the uncorrupted mean at rate $O(1/k)$ and bound the MMD between the selected subset and the clean distribution, even under Gross Corruption up to $ rac{1}{2}$. Empirically, GM Matching consistently outperforms baselines across image classification and unconditional image generation tasks, especially at high corruption and aggressive pruning, demonstrating robust, transferable data selection. The method scales via sub-sampling for GM estimation and batched processing, making it practical for large-scale, noisy data regimes with real-world applicability.

Abstract

Data pruning -- the combinatorial task of selecting a small and representative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training data-hungry modern deep learning models at scale. Since large scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. However, existing data pruning methods often fail under high corruption rates due to their reliance on empirical mean estimation, which is highly sensitive to outliers. In response, we propose Geometric Median (GM) Matching, a novel k-subset selection strategy that leverages Geometric Median -- a robust estimator with an optimal breakdown point of 1/2; to enhance resilience against noisy data. Our method iteratively selects a k-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption. We provide theoretical guarantees, showing that GM Matching enjoys an improved O(1/k) convergence rate -- a quadratic improvement over random sampling, even under arbitrary corruption. Extensive experiments across image classification and image generation tasks demonstrate that GM Matching consistently outperforms existing pruning approaches, particularly in high-corruption settings and at high pruning rates; making it a strong baseline for robust data pruning.

Geometric Median Matching for Robust k-Subset Selection from Noisy Data

TL;DR

This work tackles robust data pruning for training large models by addressing the fragility of mean-based subset selection under arbitrary data corruption. The authors introduce Geometric Median (GM) Matching, which replaces the empirical mean with the geometric median in a kernel-based moment-matching framework and solves a greedy, kernel-inspired subset selection procedure. They provide theoretical guarantees that GM Matching converges to a neighborhood of the uncorrupted mean at rate and bound the MMD between the selected subset and the clean distribution, even under Gross Corruption up to . Empirically, GM Matching consistently outperforms baselines across image classification and unconditional image generation tasks, especially at high corruption and aggressive pruning, demonstrating robust, transferable data selection. The method scales via sub-sampling for GM estimation and batched processing, making it practical for large-scale, noisy data regimes with real-world applicability.

Abstract

Data pruning -- the combinatorial task of selecting a small and representative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training data-hungry modern deep learning models at scale. Since large scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. However, existing data pruning methods often fail under high corruption rates due to their reliance on empirical mean estimation, which is highly sensitive to outliers. In response, we propose Geometric Median (GM) Matching, a novel k-subset selection strategy that leverages Geometric Median -- a robust estimator with an optimal breakdown point of 1/2; to enhance resilience against noisy data. Our method iteratively selects a k-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption. We provide theoretical guarantees, showing that GM Matching enjoys an improved O(1/k) convergence rate -- a quadratic improvement over random sampling, even under arbitrary corruption. Extensive experiments across image classification and image generation tasks demonstrate that GM Matching consistently outperforms existing pruning approaches, particularly in high-corruption settings and at high pruning rates; making it a strong baseline for robust data pruning.

Paper Structure

This paper contains 46 sections, 5 theorems, 94 equations, 16 figures, 11 tables, 2 algorithms.

Key Result

Theorem 1

Suppose that we are given a set of grossly corrupted samples ${\mathcal{D}} = {\mathcal{D}}_{\mathcal{G}} \cup {\mathcal{D}}_{\mathcal{B}}$ (def:corruption_model), an $\epsilon$ approx. $\textsc{Gm}(\cdot)$ oracle $\bm{\mu}^\textsc{Gm}_\epsilon(\cdot)$eq:gm and bounded, characteristic feature map $\ where, $\sigma^2({\mathcal{D}}_{\mathcal{G}}) = \frac{1}{|{\mathcal{D}}_{\mathcal{G}}|}\sum_{{\math

Figures (16)

  • Figure 1: Robust Mean Estimation: As the corruption rate $0 \leq \psi < \frac{1}{2}$ increases (representing the fraction of samples drawn from an adversary-chosen distribution), the empirical mean increasingly deviates from the true uncorrupted mean. In contrast, the geometric median ($\textsc{Gm}$) remains robust and stays closer to the uncorrupted (oracle) mean, demonstrating its resilience to outliers.
  • Figure 2: Data Pruning in The Wild (Sampling from Noisy Gaussian): Subset 10% of the examples from anisotropic Gaussian (blue) where 40% of the samples replaced by an adversarial distribution (red). We compare $\textsc{Gm}\;\text{Matching}$ with several spatial sampling algorithms (\ref{['sec:exp-baselines']}): Random, Easy, Hard, Moderate, and Kernel Herding. $\textsc{Gm}\;\text{Matching}$ yields significantly more robust subset than the other approaches.
  • Figure 3: Image Corruption: Distinct types of data corruption applied on MNIST images.
  • Figure 4: (Proxy Embedding Space) t-SNE Visualization of CLIP ViT-B/32 Embeddings of a subset of MNIST images from \ref{['fig:diffusion_noise_samples_mnist']}: (a) Clean (baseline), and where 45% samples corrupted with (b) Gaussian noise, (c) Uniform noise, (d) Random patches, (e) Cutout noise.
  • Figure 5: Geometric Median Visualization: The plots illustrate the computation of the geometric median (denoted by the pink circle) for three different spatial point configurations: (a) Triangle, (b) Pentagon, and (c) Random. The color gradient represents the sum of distances $\rho({\mathbf{z}})$ from a candidate point ${\mathbf{z}}$ to all data points, with darker regions indicating smaller values of $\rho({\mathbf{z}})$. The white dashed lines show the connections between the geometric median and the data points, emphasizing how the geometric median minimizes the total Euclidean distance to all points. Additionally, the Voronoi regions formed around the data points visually partition the space based on proximity, offering insight into how the geometric median balances contributions from each point. In symmetric configurations such as (a) and (b), the Voronoi structure highlights the symmetry in influence regions, leading to a geometric median located at the center. For the random configuration (c), the irregular Voronoi regions illustrate the varying influence of data points, with the geometric median robustly adapting to minimize the total distance while down-weighting the effect of outlier-like points.
  • ...and 11 more figures

Theorems & Definitions (17)

  • Definition 1: Gross Corruption
  • Definition 2: Breakdown Point
  • Definition 3: Convex Hull
  • Definition 4: Geometric Median
  • Theorem 1
  • lemma 1
  • lemma 2: $\textsc{Gm}\;\text{Matching}$ Computational Complexity
  • Definition 5: Multivariate Gaussian
  • Definition 6: Isotropic Gaussian
  • Definition 7: Anisotropic Gaussian
  • ...and 7 more