Data Pruning via Separability, Integrity, and Model Uncertainty-Aware Importance Sampling
Steven Grosz, Rui Zhao, Rajeev Ranjan, Hongcheng Wang, Manoj Aggarwal, Gerard Medioni, Anil Jain
TL;DR
This work addresses the challenge of scalable, robust data pruning for image classification, particularly under high pruning ratios and class imbalance. It introduces SIM, a metric that jointly quantifies data separability, data integrity, and model uncertainty, and SIMS, an adaptive importance-sampling pruning framework that biases sample retention based on pruning ratio and class distribution. SIM combines per-sample, multi-model estimates into a single score via normalized separability, integrity, and uncertainty terms, enabling pruning that preserves informative, high-quality samples across classes; SIMS further uses a ratio-aware sampling distribution with a hybrid class-dependent/independent strategy. Empirical results on CIFAR-10/100, Tiny-ImageNet, and iNaturalist show that SIMS outperforms baselines, especially at high pruning ratios, and generalizes across architectures, while reducing pruning-metric computation time, highlighting practical impact for scalable model training and deployment.
Abstract
This paper improves upon existing data pruning methods for image classification by introducing a novel pruning metric and pruning procedure based on importance sampling. The proposed pruning metric explicitly accounts for data separability, data integrity, and model uncertainty, while the sampling procedure is adaptive to the pruning ratio and considers both intra-class and inter-class separation to further enhance the effectiveness of pruning. Furthermore, the sampling method can readily be applied to other pruning metrics to improve their performance. Overall, the proposed approach scales well to high pruning ratio and generalizes better across different classification models, as demonstrated by experiments on four benchmark datasets, including the fine-grained classification scenario.
