Table of Contents
Fetching ...

Towards Trustworthy Dataset Distillation

Shijie Ma, Fei Zhu, Zhen Cheng, Xu-Yao Zhang

TL;DR

TrustDD addresses the dual challenges of efficiency and trustworthiness in dataset distillation by distilling both in-distribution data and outliers, and introducing pseudo-outlier exposure to avoid dependence on real outlier data. The method extends gradient/trajectory-based DD with an integrated OOD detection objective and a corruption-based POE pipeline, enabling models trained on a tiny distilled set to perform well on both InD classification and OOD detection. Empirical results across CIFAR, ImageNet subsets, and digit datasets show substantial OOD improvements with POE often surpassing OE, while InD accuracy remains competitive. This approach demonstrates that distilling targeted outlier information alongside InD data yields robust open-world performance and provides a practical baseline for trustworthy DD.

Abstract

Efficiency and trustworthiness are two eternal pursuits when applying deep learning in real-world applications. With regard to efficiency, dataset distillation (DD) endeavors to reduce training costs by distilling the large dataset into a tiny synthetic dataset. However, existing methods merely concentrate on in-distribution (InD) classification in a closed-world setting, disregarding out-of-distribution (OOD) samples. On the other hand, OOD detection aims to enhance models' trustworthiness, which is always inefficiently achieved in full-data settings. For the first time, we simultaneously consider both issues and propose a novel paradigm called Trustworthy Dataset Distillation (TrustDD). By distilling both InD samples and outliers, the condensed datasets are capable of training models competent in both InD classification and OOD detection. To alleviate the requirement of real outlier data, we further propose to corrupt InD samples to generate pseudo-outliers, namely Pseudo-Outlier Exposure (POE). Comprehensive experiments on various settings demonstrate the effectiveness of TrustDD, and POE surpasses the state-of-the-art method Outlier Exposure (OE). Compared with the preceding DD, TrustDD is more trustworthy and applicable to open-world scenarios. Our code is available at https://github.com/mashijie1028/TrustDD

Towards Trustworthy Dataset Distillation

TL;DR

TrustDD addresses the dual challenges of efficiency and trustworthiness in dataset distillation by distilling both in-distribution data and outliers, and introducing pseudo-outlier exposure to avoid dependence on real outlier data. The method extends gradient/trajectory-based DD with an integrated OOD detection objective and a corruption-based POE pipeline, enabling models trained on a tiny distilled set to perform well on both InD classification and OOD detection. Empirical results across CIFAR, ImageNet subsets, and digit datasets show substantial OOD improvements with POE often surpassing OE, while InD accuracy remains competitive. This approach demonstrates that distilling targeted outlier information alongside InD data yields robust open-world performance and provides a practical baseline for trustworthy DD.

Abstract

Efficiency and trustworthiness are two eternal pursuits when applying deep learning in real-world applications. With regard to efficiency, dataset distillation (DD) endeavors to reduce training costs by distilling the large dataset into a tiny synthetic dataset. However, existing methods merely concentrate on in-distribution (InD) classification in a closed-world setting, disregarding out-of-distribution (OOD) samples. On the other hand, OOD detection aims to enhance models' trustworthiness, which is always inefficiently achieved in full-data settings. For the first time, we simultaneously consider both issues and propose a novel paradigm called Trustworthy Dataset Distillation (TrustDD). By distilling both InD samples and outliers, the condensed datasets are capable of training models competent in both InD classification and OOD detection. To alleviate the requirement of real outlier data, we further propose to corrupt InD samples to generate pseudo-outliers, namely Pseudo-Outlier Exposure (POE). Comprehensive experiments on various settings demonstrate the effectiveness of TrustDD, and POE surpasses the state-of-the-art method Outlier Exposure (OE). Compared with the preceding DD, TrustDD is more trustworthy and applicable to open-world scenarios. Our code is available at https://github.com/mashijie1028/TrustDD
Paper Structure (62 sections, 12 equations, 20 figures, 14 tables, 1 algorithm)

This paper contains 62 sections, 12 equations, 20 figures, 14 tables, 1 algorithm.

Figures (20)

  • Figure 1: Advantages of the proposed TrustDD over preceding dataset distillation (Ordinary DD). For test OOD samples, models trained by Ordinary DD assign high confidence and misclassify bubbly samples in Texture cimpoi2014describing as cats, while TrustDD is capable to train reliable models to reject OOD samples with low confidence.
  • Figure 2: Maximum Softmax Probability score distribution of InD and OOD samples. TrustDD could train better OOD detectors than ordinary DD (Figure \ref{['subfig:texture-baseline']}$\to$\ref{['subfig:texture-poe']}). Models are trained on CIFAR10 with 50 Image Per Class (IPC).
  • Figure 3: Visualization of InD corruption to synthesize pseudo-outliers on CIFAR10 krizhevsky2009learning. The corruption transformations are: jigsaw, invert, mosaic and speckle.
  • Figure 4: In-distribution accuracy ($\%$) of model trained on distilled data on CIFAR10 and CIFAR100.
  • Figure 5: The rationale of TrustDD. (a). OOD detection performance of distilling only InD and distilling both InD and OOD to same size $|\mathcal{S}|$. All InD: $|\mathcal{S}_\textrm{in}|=|\mathcal{S}|,|\mathcal{S}_\textrm{out}|=0$. InD+OE and InD+POE: $|\mathcal{S}_\textrm{in}|=|\mathcal{S}_\textrm{out}|=\frac{1}{2}|\mathcal{S}|$. (b). OOD detection performance of OE-R, OE-D, POE-R and POE-D, where '-R' denotes 'randomly selected' outliers while '-D' denotes 'distilled' outliers via TrustDD.
  • ...and 15 more figures