DRUPI: Dataset Reduction Using Privileged Information
Shaobo Wang, Yantai Yang, Shuaiyu Zhang, Chenghao Sun, Weiya Li, Xuming Hu, Linfeng Zhang
TL;DR
DRUPI extends dataset reduction by introducing privileged information in the form of feature labels and attention labels that accompany the reduced data. It develops a structured pipeline (determine, synthesize, and learn with PI) with task-oriented supervision and versatility across multiple feature-label sets, and provides a VC-theory justification for improved learning in low-data regimes. Empirically, DRUPI enhances both coreset selection and dataset distillation across CIFAR-10/100, Tiny ImageNet, and ImageNet subsets, with notable cross-architecture gains. The work demonstrates that balancing discriminability and diversity in privileged signals is crucial for maximizing the quality of compressed datasets and opens a path toward richer DR pipelines.
Abstract
Dataset reduction (DR) seeks to select or distill samples from large datasets into smaller subsets while preserving performance on target tasks. Existing methods primarily focus on pruning or synthesizing data in the same format as the original dataset, typically the input data and corresponding labels. However, in DR settings, we find it is possible to synthesize more information beyond the data-label pair as an additional learning target to facilitate model training. In this paper, we introduce Dataset Reduction Using Privileged Information (DRUPI), which enriches DR by synthesizing privileged information alongside the reduced dataset. This privileged information can take the form of feature labels or attention labels, providing auxiliary supervision to improve model learning. Our findings reveal that effective feature labels must balance between being overly discriminative and excessively diverse, with a moderate level proving optimal for improving the reduced dataset's efficacy. Extensive experiments on ImageNet, CIFAR-10/100, and Tiny ImageNet demonstrate that DRUPI integrates seamlessly with existing dataset reduction methods, offering significant performance gains. *The code will be released after the paper is accepted.*
