Table of Contents
Fetching ...

UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective

Furui Xu, Shaobo Wang, Jiajun Zhang, Chenghao Sun, Haixiang Tang, Linfeng Zhang

TL;DR

UNSEEN reframes dataset pruning from a generalization perspective to overcome the dense, indistinguishable scoring that arises when models are trained on the full data. By employing cross-validated scoring across folds and an incremental, multi-stage coreset construction, UNSEEN stabilizes sample importance assessments and enhances discriminative selection. The framework is plug-and-play, improving existing pruning Methods and enabling a multi-step evaluate-and-refill process that dynamically optimizes the coreset quality. Experiments across CIFAR-10/100 and ImageNet-1K show substantial gains over state-of-the-art methods, including lossless pruning on ImageNet-1K at 30% reduction, with additional insights that prioritizing hard-class samples reduces inter-class disparity and improves overall generalization.

Abstract

The growing scale of datasets in deep learning has introduced significant computational challenges. Dataset pruning addresses this challenge by constructing a compact but informative coreset from the full dataset with comparable performance. Previous approaches typically establish scoring metrics based on specific criteria to identify representative samples. However, these methods predominantly rely on sample scores obtained from the model's performance during the training (i.e., fitting) phase. As scoring models achieve near-optimal performance on training data, such fitting-centric approaches induce a dense distribution of sample scores within a narrow numerical range. This concentration reduces the distinction between samples and hinders effective selection. To address this challenge, we conduct dataset pruning from the perspective of generalization, i.e., scoring samples based on models not exposed to them during training. We propose a plug-and-play framework, UNSEEN, which can be integrated into existing dataset pruning methods. Additionally, conventional score-based methods are single-step and rely on models trained solely on the complete dataset, providing limited perspective on the importance of samples. To address this limitation, we scale UNSEEN to multi-step scenarios and propose an incremental selection technique through scoring models trained on varying coresets, and optimize the quality of the coreset dynamically. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art (SOTA) methods on CIFAR-10, CIFAR-100, and ImageNet-1K. Notably, on ImageNet-1K, UNSEEN achieves lossless performance while reducing training data by 30\%.

UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective

TL;DR

UNSEEN reframes dataset pruning from a generalization perspective to overcome the dense, indistinguishable scoring that arises when models are trained on the full data. By employing cross-validated scoring across folds and an incremental, multi-stage coreset construction, UNSEEN stabilizes sample importance assessments and enhances discriminative selection. The framework is plug-and-play, improving existing pruning Methods and enabling a multi-step evaluate-and-refill process that dynamically optimizes the coreset quality. Experiments across CIFAR-10/100 and ImageNet-1K show substantial gains over state-of-the-art methods, including lossless pruning on ImageNet-1K at 30% reduction, with additional insights that prioritizing hard-class samples reduces inter-class disparity and improves overall generalization.

Abstract

The growing scale of datasets in deep learning has introduced significant computational challenges. Dataset pruning addresses this challenge by constructing a compact but informative coreset from the full dataset with comparable performance. Previous approaches typically establish scoring metrics based on specific criteria to identify representative samples. However, these methods predominantly rely on sample scores obtained from the model's performance during the training (i.e., fitting) phase. As scoring models achieve near-optimal performance on training data, such fitting-centric approaches induce a dense distribution of sample scores within a narrow numerical range. This concentration reduces the distinction between samples and hinders effective selection. To address this challenge, we conduct dataset pruning from the perspective of generalization, i.e., scoring samples based on models not exposed to them during training. We propose a plug-and-play framework, UNSEEN, which can be integrated into existing dataset pruning methods. Additionally, conventional score-based methods are single-step and rely on models trained solely on the complete dataset, providing limited perspective on the importance of samples. To address this limitation, we scale UNSEEN to multi-step scenarios and propose an incremental selection technique through scoring models trained on varying coresets, and optimize the quality of the coreset dynamically. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art (SOTA) methods on CIFAR-10, CIFAR-100, and ImageNet-1K. Notably, on ImageNet-1K, UNSEEN achieves lossless performance while reducing training data by 30\%.

Paper Structure

This paper contains 13 sections, 1 equation, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: (a) Distribution of Entropy score on CIFAR-10 and CIFAR-100 under the fitting and UNSEEN frameworks. Under the fitting framework, Entropy scores exhibit dense clustering. Conversely, UNSEEN achieves uniform score dispersion, substantially improving discriminative separability. (b) Distribution of the rank assigned to each sample in the overall score ranking in two identical CIFAR-100 trials with different random seeds. Sample ranks fluctuate significantly under fitting but remain stable under UNSEEN. The Pearson correlation coefficient (PCC) between trials is 0.92 for UNSEEN, much higher than 0.43 under fitting.
  • Figure 2: Samples exhibit varying levels of importance across coresets of different stages. Incremental selection prioritizes samples with the highest importance at each stage, offering a more principled and adaptive approach to coreset construction.
  • Figure 3: The pipeline of UNSEEN. First, the training dataset is randomly partitioned into $K$ equal-sized subsets. Then, for each subset, a scoring model is trained and used to assign scores to the samples in the complementary subsets. The top $M_1$ samples with the highest scores are selected to form the initial coreset $S_1$. Next, a scoring model is trained on the selected samples and used to score the remaining unselected samples. Samples with the highest scores are incrementally added to the coreset. This procedure is repeated until the desired number of samples has been selected.
  • Figure 4: Plug-and-play enhancement of UNSEEN on CIFAR-10 (left) and CIFAR-100 (right). Margin and Least Confidence achieve significant enhancement with UNSEEN and outperform TDDS at low pruning rates.
  • Figure 5: Comprehensive comparison on fine-grained datasets, demonstrating UNSEEN’s superior performance.
  • ...and 4 more figures