Table of Contents
Fetching ...

DRoP: Distributionally Robust Data Pruning

Artem Vysogorets, Kartik Ahuja, Julia Kempe

TL;DR

This paper tackles the problem that data pruning, while improving efficiency, can exacerbate classification bias across classes in deep learning. It introduces DRoP, a distributionally robust pruning scheme that allocates pruning quotas d_k proportional to (1 − r_k) for each class, with d_k = d(1 − r_k)/Z, enabling random within-class pruning guided by hold-out validation errors. The authors provide a theoretical Gaussian-mixture analysis showing how optimal class priors align average and worst-class risks and motivate the use of error-based quotas to approach worst-case robustness. Empirically, DRoP combined with random pruning (Random+DRoP) yields superior distributional robustness across diverse benchmarks (CIFAR, TinyImageNet, ImageNet, Waterbirds) and remains effective under imbalance and group-robust settings, often outperforming full-dataset or baseline pruning approaches. Overall, DRoP improves worst-class performance with tolerable average-loss trade-offs, offering practical data efficiency while mitigating classification bias in pruning workflows. $d_k eq 1$ can be accommodated by distributing excess density to unsaturated classes, maintaining the target density $dN$ while prioritizing harder classes.$

Abstract

In the era of exceptionally data-hungry models, careful selection of the training data is essential to mitigate the extensive costs of deep learning. Data pruning offers a solution by removing redundant or uninformative samples from the dataset, which yields faster convergence and improved neural scaling laws. However, little is known about its impact on classification bias of the trained models. We conduct the first systematic study of this effect and reveal that existing data pruning algorithms can produce highly biased classifiers. We present theoretical analysis of the classification risk in a mixture of Gaussians to argue that choosing appropriate class pruning ratios, coupled with random pruning within classes has potential to improve worst-class performance. We thus propose DRoP, a distributionally robust approach to pruning and empirically demonstrate its performance on standard computer vision benchmarks. In sharp contrast to existing algorithms, our proposed method continues improving distributional robustness at a tolerable drop of average performance as we prune more from the datasets.

DRoP: Distributionally Robust Data Pruning

TL;DR

This paper tackles the problem that data pruning, while improving efficiency, can exacerbate classification bias across classes in deep learning. It introduces DRoP, a distributionally robust pruning scheme that allocates pruning quotas d_k proportional to (1 − r_k) for each class, with d_k = d(1 − r_k)/Z, enabling random within-class pruning guided by hold-out validation errors. The authors provide a theoretical Gaussian-mixture analysis showing how optimal class priors align average and worst-class risks and motivate the use of error-based quotas to approach worst-case robustness. Empirically, DRoP combined with random pruning (Random+DRoP) yields superior distributional robustness across diverse benchmarks (CIFAR, TinyImageNet, ImageNet, Waterbirds) and remains effective under imbalance and group-robust settings, often outperforming full-dataset or baseline pruning approaches. Overall, DRoP improves worst-class performance with tolerable average-loss trade-offs, offering practical data efficiency while mitigating classification bias in pruning workflows. can be accommodated by distributing excess density to unsaturated classes, maintaining the target density while prioritizing harder classes.$

Abstract

In the era of exceptionally data-hungry models, careful selection of the training data is essential to mitigate the extensive costs of deep learning. Data pruning offers a solution by removing redundant or uninformative samples from the dataset, which yields faster convergence and improved neural scaling laws. However, little is known about its impact on classification bias of the trained models. We conduct the first systematic study of this effect and reveal that existing data pruning algorithms can produce highly biased classifiers. We present theoretical analysis of the classification risk in a mixture of Gaussians to argue that choosing appropriate class pruning ratios, coupled with random pruning within classes has potential to improve worst-class performance. We thus propose DRoP, a distributionally robust approach to pruning and empirically demonstrate its performance on standard computer vision benchmarks. In sharp contrast to existing algorithms, our proposed method continues improving distributional robustness at a tolerable drop of average performance as we prune more from the datasets.
Paper Structure (31 sections, 2 theorems, 22 equations, 18 figures, 2 tables, 1 algorithm)

This paper contains 31 sections, 2 theorems, 22 equations, 18 figures, 2 tables, 1 algorithm.

Key Result

Theorem B.1

If Equation assumption-1 holds, define $t^{*}(\phi_0/\phi_1)$ as in Equation Eq:Solution. Then, $t^{*}(\phi_0/\phi_1)$ is the statistical risk minimizer for the Gaussian mixture model if

Figures (18)

  • Figure 1: Pruning Exacerbates Bias: Dynamic Uncertainty applied to CIFAR-100. See Appendix \ref{['App:Additional']} for similar plots for other pruning methods and models. Left: Sorted class densities at different dataset density levels. We also report the minimum number of samples per class (SPC) at $10\%$ dataset density. Right: Full dataset test class-wise accuracy against dataset density. We also report the correlation coefficient between these two quantities across classes, averaged over $5$ dataset densities.
  • Figure 2: The average test performance of various data pruning algorithms against dataset density (fraction of samples remaining after pruning) and worst-class accuracy. All results averaged over $3$ random seeds. Error bands represent min/max. Additional plots can be found in Appendix \ref{['App:Full-Scatters']}.
  • Figure 3: The effect of different pruning procedures on the solution mixture of Gaussians problem with $\mu_0=-1$, $\mu_1=1$, $\sigma_0=0.5$, $\sigma_1=1$, and $\phi_0=\phi_1$. Pruning to dataset density $d=50\%$. Left: Random pruning with the optimal class-wise densities that satisfy $d_1\phi_1\sigma_0=d_0\phi_0\sigma_1$. Middle: SSP. Right: Random pruning with respect to class ratios provided by the SSP algorithm. All results averaged across $10$ datasets $\{D_i\}_{i=1}^{10}$ each with $400$ points. The average ERM is $\overline{T}=\frac{1}{10}\sum_{i=1}^{10}T(D'_i)$ fitted to pruned datasets $D'_i$. The class risks of the average and worst-class optimal decisions for this Gaussian mixture are $R_0[t^{*}(1)]=4.8\%$, $R_1[t^{*}(1)]=12.1\%$, and $R_0(\hat{t})=R_1(\hat{t})=9.1\%$.
  • Figure 4: (a): Class-wise risk ratios of the optimal solution $t^{*}=t^{*}(\phi_0/\phi_1)$ vs. optimal ratios based on Equation \ref{['Eq:optimal']} computed for various $\sigma_0<\sigma_1$ drawn uniformly from $[10^{-2}, 10^2]$ and $\phi_0\sim U[0,1]$ and $\phi_1=1-\phi_0$. The results are independent of $\mu_0, \mu_1$. (b): Random pruning with DRoP. Left:$d=75\%$; Right:$d=50\%$.
  • Figure 5: The average test performance of various data pruning protocols against dataset density and worst-class accuracy. All results averaged over $3$ random seeds. Error bands represent min/max.
  • ...and 13 more figures

Theorems & Definitions (4)

  • Theorem B.1
  • proof
  • Lemma B.2
  • proof