Table of Contents
Fetching ...

DRUPI: Dataset Reduction Using Privileged Information

Shaobo Wang, Yantai Yang, Shuaiyu Zhang, Chenghao Sun, Weiya Li, Xuming Hu, Linfeng Zhang

TL;DR

DRUPI extends dataset reduction by introducing privileged information in the form of feature labels and attention labels that accompany the reduced data. It develops a structured pipeline (determine, synthesize, and learn with PI) with task-oriented supervision and versatility across multiple feature-label sets, and provides a VC-theory justification for improved learning in low-data regimes. Empirically, DRUPI enhances both coreset selection and dataset distillation across CIFAR-10/100, Tiny ImageNet, and ImageNet subsets, with notable cross-architecture gains. The work demonstrates that balancing discriminability and diversity in privileged signals is crucial for maximizing the quality of compressed datasets and opens a path toward richer DR pipelines.

Abstract

Dataset reduction (DR) seeks to select or distill samples from large datasets into smaller subsets while preserving performance on target tasks. Existing methods primarily focus on pruning or synthesizing data in the same format as the original dataset, typically the input data and corresponding labels. However, in DR settings, we find it is possible to synthesize more information beyond the data-label pair as an additional learning target to facilitate model training. In this paper, we introduce Dataset Reduction Using Privileged Information (DRUPI), which enriches DR by synthesizing privileged information alongside the reduced dataset. This privileged information can take the form of feature labels or attention labels, providing auxiliary supervision to improve model learning. Our findings reveal that effective feature labels must balance between being overly discriminative and excessively diverse, with a moderate level proving optimal for improving the reduced dataset's efficacy. Extensive experiments on ImageNet, CIFAR-10/100, and Tiny ImageNet demonstrate that DRUPI integrates seamlessly with existing dataset reduction methods, offering significant performance gains. *The code will be released after the paper is accepted.*

DRUPI: Dataset Reduction Using Privileged Information

TL;DR

DRUPI extends dataset reduction by introducing privileged information in the form of feature labels and attention labels that accompany the reduced data. It develops a structured pipeline (determine, synthesize, and learn with PI) with task-oriented supervision and versatility across multiple feature-label sets, and provides a VC-theory justification for improved learning in low-data regimes. Empirically, DRUPI enhances both coreset selection and dataset distillation across CIFAR-10/100, Tiny ImageNet, and ImageNet subsets, with notable cross-architecture gains. The work demonstrates that balancing discriminability and diversity in privileged signals is crucial for maximizing the quality of compressed datasets and opens a path toward richer DR pipelines.

Abstract

Dataset reduction (DR) seeks to select or distill samples from large datasets into smaller subsets while preserving performance on target tasks. Existing methods primarily focus on pruning or synthesizing data in the same format as the original dataset, typically the input data and corresponding labels. However, in DR settings, we find it is possible to synthesize more information beyond the data-label pair as an additional learning target to facilitate model training. In this paper, we introduce Dataset Reduction Using Privileged Information (DRUPI), which enriches DR by synthesizing privileged information alongside the reduced dataset. This privileged information can take the form of feature labels or attention labels, providing auxiliary supervision to improve model learning. Our findings reveal that effective feature labels must balance between being overly discriminative and excessively diverse, with a moderate level proving optimal for improving the reduced dataset's efficacy. Extensive experiments on ImageNet, CIFAR-10/100, and Tiny ImageNet demonstrate that DRUPI integrates seamlessly with existing dataset reduction methods, offering significant performance gains. *The code will be released after the paper is accepted.*
Paper Structure (26 sections, 12 equations, 5 figures, 14 tables, 1 algorithm)

This paper contains 26 sections, 12 equations, 5 figures, 14 tables, 1 algorithm.

Figures (5)

  • Figure 1: A comparison between conventional dataset reduction pipelines and our proposed DRUPI framework. (a) Previous dataset reduction methods distill or select a subset $\mathcal{D}_{\mathcal{S}}$ from the original dataset $\mathcal{D}_{\mathcal{T}}$, maintaining the original "data-label’’ structure. (b) In contrast, DRUPI synthesizes auxiliary privileged information from $\mathcal{D}_{\mathcal{T}}$, enriching further supervision to models trained on the reduced subset $\mathcal{D}_{\mathcal{S}}$. (c) Cosine similarity between the gradients of a pre-trained model on synthetic datasets w/ and w/o privileged information (feature labels) and the real dataset. Synthetic datasets are generated using DC with 10 IPC. We used the same pre-trained ConvNet for gradient extraction.
  • Figure 2: Comparison between (a) the traditional "data-label" structure and (b) Different forms of privileged information. Non-target classes of soft labels provide additional information, can be considered a form of privileged information. Feature labels encapsulate high-dimensional information. Attention labels are obtained by applying average pooling to feature labels.
  • Figure 3: Feature labels learned under varying levels of task supervision. (a) t-SNE visualization of feature labels learned with different task supervision coefficients $\lambda_{task}$. (b) The most effective feature labels are produced with a moderate level of task supervision, avoiding excessively high or low supervision. (c) Increasing task supervision makes the feature labels more discriminative but less diverse. Diversity is measured by the negative mutual information between the feature labels and the ground truth labels, while discriminability is measured by the classification accuracy of a linear classifier trained on the feature labels.
  • Figure 4: (a) Comparison of different methods for obtaining feature labels in datasets initialized with various distillation methods. Our results indicate that learning-based methods yield the best performance. (b) Impact of feature label versatility and the utilization of multiple feature labels. We find that incorporating more feature labels produces a more robust reduced dataset, with averaging the features outperforming random selection. (c) Evaluation of supervision using different layers from a depth-3 ConvNet for synthesizing feature labels. Results show that, across different IPCs and datasets, using the final layer features for supervision generates the most effective reduced dataset.
  • Figure 5: Comparison of noise initialization (yellow) and initialization with assigned features (blue) from a pre-trained ConvNet on CIFAR-10 and CIFAR-100 across different IPC settings.