Generative Dataset Distillation Based on Diffusion Model
Duo Su, Junjie Hou, Guang Li, Ren Togo, Rui Song, Takahiro Ogawa, Miki Haseyama
TL;DR
The paper tackles the challenge of data-efficient learning under stringent time constraints by proposing a generative dataset distillation framework based on SDXL‑Turbo diffusion. It leverages text-conditioned prompts derived from class labels, one-step Text2Image sampling, and post data augmentation, guided by adversarial and distillation losses to produce high-fidelity synthetic data with large IPC. Empirically, the approach achieves IPC values of 10 for Tiny‑ImageNet and 20 for CIFAR‑100, and secured third place in the ECCV 2024 DD Challenge, with PDA further boosting performance. This work demonstrates that fast, diffusion-based generative DD can yield scalable, label-aware distilled datasets suitable for rapid training while highlighting distribution discrepancies as a key challenge when distilling smaller datasets.
Abstract
This paper presents our method for the generative track of The First Dataset Distillation Challenge at ECCV 2024. Since the diffusion model has become the mainstay of generative models because of its high-quality generative effects, we focus on distillation methods based on the diffusion model. Considering that the track can only generate a fixed number of images in 10 minutes using a generative model for CIFAR-100 and Tiny-ImageNet datasets, we need to use a generative model that can generate images at high speed. In this study, we proposed a novel generative dataset distillation method based on Stable Diffusion. Specifically, we use the SDXL-Turbo model which can generate images at high speed and quality. Compared to other diffusion models that can only generate images per class (IPC) = 1, our method can achieve an IPC = 10 for Tiny-ImageNet and an IPC = 20 for CIFAR-100, respectively. Additionally, to generate high-quality distilled datasets for CIFAR-100 and Tiny-ImageNet, we use the class information as text prompts and post data augmentation for the SDXL-Turbo model. Experimental results show the effectiveness of the proposed method, and we achieved third place in the generative track of the ECCV 2024 DD Challenge. Codes are available at https://github.com/Guang000/BANKO.
