Data Pruning via Separability, Integrity, and Model Uncertainty-Aware Importance Sampling

Steven Grosz; Rui Zhao; Rajeev Ranjan; Hongcheng Wang; Manoj Aggarwal; Gerard Medioni; Anil Jain

Data Pruning via Separability, Integrity, and Model Uncertainty-Aware Importance Sampling

Steven Grosz, Rui Zhao, Rajeev Ranjan, Hongcheng Wang, Manoj Aggarwal, Gerard Medioni, Anil Jain

TL;DR

This work addresses the challenge of scalable, robust data pruning for image classification, particularly under high pruning ratios and class imbalance. It introduces SIM, a metric that jointly quantifies data separability, data integrity, and model uncertainty, and SIMS, an adaptive importance-sampling pruning framework that biases sample retention based on pruning ratio and class distribution. SIM combines per-sample, multi-model estimates into a single score via normalized separability, integrity, and uncertainty terms, enabling pruning that preserves informative, high-quality samples across classes; SIMS further uses a ratio-aware sampling distribution with a hybrid class-dependent/independent strategy. Empirical results on CIFAR-10/100, Tiny-ImageNet, and iNaturalist show that SIMS outperforms baselines, especially at high pruning ratios, and generalizes across architectures, while reducing pruning-metric computation time, highlighting practical impact for scalable model training and deployment.

Abstract

This paper improves upon existing data pruning methods for image classification by introducing a novel pruning metric and pruning procedure based on importance sampling. The proposed pruning metric explicitly accounts for data separability, data integrity, and model uncertainty, while the sampling procedure is adaptive to the pruning ratio and considers both intra-class and inter-class separation to further enhance the effectiveness of pruning. Furthermore, the sampling method can readily be applied to other pruning metrics to improve their performance. Overall, the proposed approach scales well to high pruning ratio and generalizes better across different classification models, as demonstrated by experiments on four benchmark datasets, including the fine-grained classification scenario.

Data Pruning via Separability, Integrity, and Model Uncertainty-Aware Importance Sampling

TL;DR

Abstract

Paper Structure (20 sections, 8 equations, 6 figures, 4 tables)

This paper contains 20 sections, 8 equations, 6 figures, 4 tables.

Introduction
Related Work
Data Pruning
Data Distillation
Methods
Problem Statement
Data Separability
Data Integrity
Model Uncertainty
Derivation of SIM
SIM with Importance Sampling (SIMS)
Importance Sampling based on Pruning Ratio
Combination of Class-Dependent and Class-Independent Sampling
Experimental Results
Datasets
...and 5 more sections

Figures (6)

Figure 1: Scatter plot of SIM scores for the CIFAR-100 dataset. The example images shown are from the "sweet pepper" class and are provided to give a visualization of the typical samples falling in those respective regions of the graph.
Figure 2: Illustration of the sampling distribution mean at different pruning ratio $\alpha$.
Figure 3: Illustration of varying sampling distributions for Tiny-ImageNet dataset as pruning ratio $\alpha$ increases.
Figure 4: Classification accuracy vs. training time of different pruning methods (Best viewed in color).
Figure 5: T-SNE visualization for CIFAR-10 comparing SIM to SIMS.
...and 1 more figures

Data Pruning via Separability, Integrity, and Model Uncertainty-Aware Importance Sampling

TL;DR

Abstract

Data Pruning via Separability, Integrity, and Model Uncertainty-Aware Importance Sampling

Authors

TL;DR

Abstract

Table of Contents

Figures (6)