DRoP: Distributionally Robust Data Pruning
Artem Vysogorets, Kartik Ahuja, Julia Kempe
TL;DR
This paper tackles the problem that data pruning, while improving efficiency, can exacerbate classification bias across classes in deep learning. It introduces DRoP, a distributionally robust pruning scheme that allocates pruning quotas d_k proportional to (1 − r_k) for each class, with d_k = d(1 − r_k)/Z, enabling random within-class pruning guided by hold-out validation errors. The authors provide a theoretical Gaussian-mixture analysis showing how optimal class priors align average and worst-class risks and motivate the use of error-based quotas to approach worst-case robustness. Empirically, DRoP combined with random pruning (Random+DRoP) yields superior distributional robustness across diverse benchmarks (CIFAR, TinyImageNet, ImageNet, Waterbirds) and remains effective under imbalance and group-robust settings, often outperforming full-dataset or baseline pruning approaches. Overall, DRoP improves worst-class performance with tolerable average-loss trade-offs, offering practical data efficiency while mitigating classification bias in pruning workflows. $d_k eq 1$ can be accommodated by distributing excess density to unsaturated classes, maintaining the target density $dN$ while prioritizing harder classes.$
Abstract
In the era of exceptionally data-hungry models, careful selection of the training data is essential to mitigate the extensive costs of deep learning. Data pruning offers a solution by removing redundant or uninformative samples from the dataset, which yields faster convergence and improved neural scaling laws. However, little is known about its impact on classification bias of the trained models. We conduct the first systematic study of this effect and reveal that existing data pruning algorithms can produce highly biased classifiers. We present theoretical analysis of the classification risk in a mixture of Gaussians to argue that choosing appropriate class pruning ratios, coupled with random pruning within classes has potential to improve worst-class performance. We thus propose DRoP, a distributionally robust approach to pruning and empirically demonstrate its performance on standard computer vision benchmarks. In sharp contrast to existing algorithms, our proposed method continues improving distributional robustness at a tolerable drop of average performance as we prune more from the datasets.
