Table of Contents
Fetching ...

On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm

Peng Sun, Bei Shi, Daiwei Yu, Tao Lin

TL;DR

RDED, a novel computationally-efficient yet effective data distillation paradigm, is proposed to enable both diversity and realism of the distilled data to enable both diversity and realism of the distilled data.

Abstract

Contemporary machine learning requires training large neural networks on massive datasets and thus faces the challenges of high computational demands. Dataset distillation, as a recent emerging strategy, aims to compress real-world datasets for efficient training. However, this line of research currently struggle with large-scale and high-resolution datasets, hindering its practicality and feasibility. To this end, we re-examine the existing dataset distillation methods and identify three properties required for large-scale real-world applications, namely, realism, diversity, and efficiency. As a remedy, we propose RDED, a novel computationally-efficient yet effective data distillation paradigm, to enable both diversity and realism of the distilled data. Extensive empirical results over various neural architectures and datasets demonstrate the advancement of RDED: we can distill the full ImageNet-1K to a small dataset comprising 10 images per class within 7 minutes, achieving a notable 42% top-1 accuracy with ResNet-18 on a single RTX-4090 GPU (while the SOTA only achieves 21% but requires 6 hours).

On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm

TL;DR

RDED, a novel computationally-efficient yet effective data distillation paradigm, is proposed to enable both diversity and realism of the distilled data to enable both diversity and realism of the distilled data.

Abstract

Contemporary machine learning requires training large neural networks on massive datasets and thus faces the challenges of high computational demands. Dataset distillation, as a recent emerging strategy, aims to compress real-world datasets for efficient training. However, this line of research currently struggle with large-scale and high-resolution datasets, hindering its practicality and feasibility. To this end, we re-examine the existing dataset distillation methods and identify three properties required for large-scale real-world applications, namely, realism, diversity, and efficiency. As a remedy, we propose RDED, a novel computationally-efficient yet effective data distillation paradigm, to enable both diversity and realism of the distilled data. Extensive empirical results over various neural architectures and datasets demonstrate the advancement of RDED: we can distill the full ImageNet-1K to a small dataset comprising 10 images per class within 7 minutes, achieving a notable 42% top-1 accuracy with ResNet-18 on a single RTX-4090 GPU (while the SOTA only achieves 21% but requires 6 hours).
Paper Structure (65 sections, 1 theorem, 21 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 65 sections, 1 theorem, 21 equations, 7 figures, 9 tables, 1 algorithm.

Key Result

Proposition 1

Given a distilled dataset $\mathcal{S} =(X,Y)$, we derive the following approximations to maximize the diversity term $H_{\mathcal{V}}(Y|\varnothing)$ and the realism term $-H_{\mathcal{V}}(Y|X)$:

Figures (7)

  • Figure 1: Proposed paradigm vs. optimization-based paradigm. Left is the mainstream optimization-based dataset distillation and middle is our proposed non-optimizing paradigm. Right is top-1 validation accuracy vs. synthesis time per image on ImageNet-1K with $\texttt{IPC}\xspace = 10$ (10 Images Per Class). Models used for distillation include ResNet-18, EfficientNet-B0, and MobileNet-V2; we use ResNet-18 for evaluation.
  • Figure 2: Visualization of images synthesized using various dataset distillation methods. We consider the ImageNet-Fruits cazenavette2022dataset dataset, comprising a total of 10 distinct fruit types, with a resolution of $128 \times 128$. There are four specific classes for each method, namely, 1) Pineapple, 2) Banana, 3) Pomegranate, and 4) Fig. Note that MTT cazenavette2022dataset, GLaD cazenavette2023generalizing, SRe$^2$L yin2023squeeze, and Herding welling2009herding, are four representative methods of conventional dataset distillation paradigms discussed in Section \ref{['sec:relatedwork']} and Section \ref{['sec:pitfalls']} (see Appendix \ref{['sec:aux_visualization']} for more visualization). In general, ensuring both superior realism and diversity simultaneously is challenging for methods other than ours and GLaD.
  • Figure 3: Visualization of our proposed two-stage dataset distillation framework. Stage 1: We crop each original image into several patches and rank them using the realism scores calculated by the observer model. Then, we choose the top-1-scored patch as the key patch. For the key patches within a class, we re-select the top-$N \times \texttt{IPC}\xspace$ patches based on their scores, where $N = 4$ in this case. Stage 2: We consolidate every $N$ selected patches from Stage 1 into a single new image that shares the same resolution with each original image, resulting in $\texttt{IPC}\xspace$-numbered distilled images per class. These images are then relabeled using the pre-trained observer model.
  • Figure 4: Ablation study on $\left\lvert\mathcal{T}_c^\prime\right\rvert$ and $N$, i.e., the pre-selected subset size $\mathcal{T}_c^\prime$ (left), and the number of patches $N$ within each distilled image (right). The emerald $\bullet$, red $\bullet$, and blue $\bullet$ denote ImageNet-10, ImageNet-100, and ImageNet-1K respectively.
  • Figure 5: Ablation study on $\left\lvert\mathcal{T}_c^\prime\right\rvert$ and $N$, i.e., the pre-selected subset size $\mathcal{T}_c^\prime$ (left), and the number of patches $N$ within each distilled image (right). The lemon $\bullet$, purple $\bullet$, and turquoise $\bullet$ denote CIFAR-10, CIFAR-100, and Tiny-ImageNet respectively.
  • ...and 2 more figures

Theorems & Definitions (7)

  • Definition 1: Properties of distilled data
  • Proposition 1: Proxies on the diversity and realism of distilled data
  • Definition 2: Predictive Family
  • Definition 3
  • Definition 4
  • proof : Maximizing diversity of distilled data
  • proof : Maximizing realism of distilled data