Table of Contents
Fetching ...

One Category One Prompt: Dataset Distillation using Diffusion Models

Ali Abbasi, Ashkan Shahbazi, Hamed Pirsiavash, Soheil Kolouri

TL;DR

This paper introduces Dataset Distillation using Diffusion Models (D3M) as a novel paradigm for dataset distillation, leveraging recent advancements in generative text-to-image foundation models to create concise and informative representations for large datasets.

Abstract

The extensive amounts of data required for training deep neural networks pose significant challenges on storage and transmission fronts. Dataset distillation has emerged as a promising technique to condense the information of massive datasets into a much smaller yet representative set of synthetic samples. However, traditional dataset distillation approaches often struggle to scale effectively with high-resolution images and more complex architectures due to the limitations in bi-level optimization. Recently, several works have proposed exploiting knowledge distillation with decoupled optimization schemes to scale up dataset distillation. Although these methods effectively address the scalability issue, they rely on extensive image augmentations requiring the storage of soft labels for augmented images. In this paper, we introduce Dataset Distillation using Diffusion Models (D3M) as a novel paradigm for dataset distillation, leveraging recent advancements in generative text-to-image foundation models. Our approach utilizes textual inversion, a technique for fine-tuning text-to-image generative models, to create concise and informative representations for large datasets. By employing these learned text prompts, we can efficiently store and infer new samples for introducing data variability within a fixed memory budget. We show the effectiveness of our method through extensive experiments across various computer vision benchmark datasets with different memory budgets.

One Category One Prompt: Dataset Distillation using Diffusion Models

TL;DR

This paper introduces Dataset Distillation using Diffusion Models (D3M) as a novel paradigm for dataset distillation, leveraging recent advancements in generative text-to-image foundation models to create concise and informative representations for large datasets.

Abstract

The extensive amounts of data required for training deep neural networks pose significant challenges on storage and transmission fronts. Dataset distillation has emerged as a promising technique to condense the information of massive datasets into a much smaller yet representative set of synthetic samples. However, traditional dataset distillation approaches often struggle to scale effectively with high-resolution images and more complex architectures due to the limitations in bi-level optimization. Recently, several works have proposed exploiting knowledge distillation with decoupled optimization schemes to scale up dataset distillation. Although these methods effectively address the scalability issue, they rely on extensive image augmentations requiring the storage of soft labels for augmented images. In this paper, we introduce Dataset Distillation using Diffusion Models (D3M) as a novel paradigm for dataset distillation, leveraging recent advancements in generative text-to-image foundation models. Our approach utilizes textual inversion, a technique for fine-tuning text-to-image generative models, to create concise and informative representations for large datasets. By employing these learned text prompts, we can efficiently store and infer new samples for introducing data variability within a fixed memory budget. We show the effectiveness of our method through extensive experiments across various computer vision benchmark datasets with different memory budgets.
Paper Structure (14 sections, 2 equations, 8 figures, 3 tables)

This paper contains 14 sections, 2 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Illustrating the core steps in our proposed framework, Dataset Distillation using Diffusion Models (D3M). Step 1 follows the work of sun2023diversity and utilizes a teacher network to identify important patches of the training data and create collages of these patches. Step 2 employs textual inversion gal2022image to optimize a single prompt per category, resulting in the creation of collage images through stable diffusion rombach2022high. Regarding labels, we consider two different settings, namely, one-hot and soft-labels. To generate soft-labels for synthetic images of each category, in Step 4, a random seed is fixed, and stable diffusion is utilized to generate collage images, which are then fed to the teacher network to obtain the soft-labels. Finally, in Step 4, the categorical prompts and random seeds are employed to create the distilled dataset and train the student.
  • Figure 2: Following the work of Sun et al. sun2023diversity, we first identify a patch per input image that results in the lowest cross-entropy loss for a pre-trained and frozen teacher model. Then, we construct collage images of these important patches.
  • Figure 3: Given a text-to-image diffusion model, such as Latent Diffusion Model (LDM) rombach2022high, for each category of the training data, we employ 'textual inversion' gal2022image to optimize a token (i.e., a prompt), $v_*$, resulting in the generation of collage images that are similar to the ones constructed in Step 1 (Figure \ref{['fig:step1']}).
  • Figure 4: Comparison of collage images generated by textual inversion versus the engineered prompt, "A 4$\times$4 natural collage of 'name_of_class' images," for classes 'pirate' and 'restaurant' in ImageNet-1k dataset.
  • Figure 5: Generated images via $\Phi(\epsilon, \rho(v^c_*))$ for different $c$s and different $\epsilon$s.
  • ...and 3 more figures