Geometric Median Matching for Robust k-Subset Selection from Noisy Data

Anish Acharya; Sujay Sanghavi; Alexandros G. Dimakis; Inderjit S Dhillon

Geometric Median Matching for Robust k-Subset Selection from Noisy Data

Anish Acharya, Sujay Sanghavi, Alexandros G. Dimakis, Inderjit S Dhillon

TL;DR

This work tackles robust data pruning for training large models by addressing the fragility of mean-based subset selection under arbitrary data corruption. The authors introduce Geometric Median (GM) Matching, which replaces the empirical mean with the geometric median in a kernel-based moment-matching framework and solves a greedy, kernel-inspired subset selection procedure. They provide theoretical guarantees that GM Matching converges to a neighborhood of the uncorrupted mean at rate $O(1/k)$ and bound the MMD between the selected subset and the clean distribution, even under Gross Corruption up to $rac{1}{2}$. Empirically, GM Matching consistently outperforms baselines across image classification and unconditional image generation tasks, especially at high corruption and aggressive pruning, demonstrating robust, transferable data selection. The method scales via sub-sampling for GM estimation and batched processing, making it practical for large-scale, noisy data regimes with real-world applicability.

Abstract

Data pruning -- the combinatorial task of selecting a small and representative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training data-hungry modern deep learning models at scale. Since large scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. However, existing data pruning methods often fail under high corruption rates due to their reliance on empirical mean estimation, which is highly sensitive to outliers. In response, we propose Geometric Median (GM) Matching, a novel k-subset selection strategy that leverages Geometric Median -- a robust estimator with an optimal breakdown point of 1/2; to enhance resilience against noisy data. Our method iteratively selects a k-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption. We provide theoretical guarantees, showing that GM Matching enjoys an improved O(1/k) convergence rate -- a quadratic improvement over random sampling, even under arbitrary corruption. Extensive experiments across image classification and image generation tasks demonstrate that GM Matching consistently outperforms existing pruning approaches, particularly in high-corruption settings and at high pruning rates; making it a strong baseline for robust data pruning.

Geometric Median Matching for Robust k-Subset Selection from Noisy Data

TL;DR

Abstract

Geometric Median Matching for Robust k-Subset Selection from Noisy Data

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (17)