UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective

Furui Xu; Shaobo Wang; Jiajun Zhang; Chenghao Sun; Haixiang Tang; Linfeng Zhang

UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective

Furui Xu, Shaobo Wang, Jiajun Zhang, Chenghao Sun, Haixiang Tang, Linfeng Zhang

TL;DR

UNSEEN reframes dataset pruning from a generalization perspective to overcome the dense, indistinguishable scoring that arises when models are trained on the full data. By employing cross-validated scoring across folds and an incremental, multi-stage coreset construction, UNSEEN stabilizes sample importance assessments and enhances discriminative selection. The framework is plug-and-play, improving existing pruning Methods and enabling a multi-step evaluate-and-refill process that dynamically optimizes the coreset quality. Experiments across CIFAR-10/100 and ImageNet-1K show substantial gains over state-of-the-art methods, including lossless pruning on ImageNet-1K at 30% reduction, with additional insights that prioritizing hard-class samples reduces inter-class disparity and improves overall generalization.

Abstract

The growing scale of datasets in deep learning has introduced significant computational challenges. Dataset pruning addresses this challenge by constructing a compact but informative coreset from the full dataset with comparable performance. Previous approaches typically establish scoring metrics based on specific criteria to identify representative samples. However, these methods predominantly rely on sample scores obtained from the model's performance during the training (i.e., fitting) phase. As scoring models achieve near-optimal performance on training data, such fitting-centric approaches induce a dense distribution of sample scores within a narrow numerical range. This concentration reduces the distinction between samples and hinders effective selection. To address this challenge, we conduct dataset pruning from the perspective of generalization, i.e., scoring samples based on models not exposed to them during training. We propose a plug-and-play framework, UNSEEN, which can be integrated into existing dataset pruning methods. Additionally, conventional score-based methods are single-step and rely on models trained solely on the complete dataset, providing limited perspective on the importance of samples. To address this limitation, we scale UNSEEN to multi-step scenarios and propose an incremental selection technique through scoring models trained on varying coresets, and optimize the quality of the coreset dynamically. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art (SOTA) methods on CIFAR-10, CIFAR-100, and ImageNet-1K. Notably, on ImageNet-1K, UNSEEN achieves lossless performance while reducing training data by 30\%.

UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective

TL;DR

Abstract

UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)