Table of Contents
Fetching ...

Random Conditioning with Distillation for Data-Efficient Diffusion Model Compression

Dohyun Kim, Sehwan Park, Geonhee Han, Seung Wook Kim, Paul Hongsuck Seo

TL;DR

The paper tackles data efficiency in distilling conditional diffusion models by introducing Random Conditioning, which pairs noised inputs with randomly chosen text prompts to expand the conditioning space without generating large image-text datasets. By integrating losses on noise prediction and intermediate features, the method distills teacher diffusion models into smaller student models in an image-free setting, enabling generation of concepts outside the training prompts. Empirical results across LAION-derived and MS-COCO data show substantial gains in FID, IS, and CLIP scores over naïve baselines, with comparable performance to teacher models when using random conditioning and data augmentation, and even enabling unseen-concept generation without real images. The approach yields data-efficient, resource-friendly diffusion model compression, achieving strong performance with block- and channel-based architectures and opening avenues for deploying diffusion models in data-constrained environments and diverse modalities.

Abstract

Diffusion models generate high-quality images through progressive denoising but are computationally intensive due to large model sizes and repeated sampling. Knowledge distillation, which transfers knowledge from a complex teacher to a simpler student model, has been widely studied in recognition tasks, particularly for transferring concepts unseen during student training. However, its application to diffusion models remains underexplored, especially in enabling student models to generate concepts not covered by the training images. In this work, we propose Random Conditioning, a novel approach that pairs noised images with randomly selected text conditions to enable efficient, image-free knowledge distillation. By leveraging this technique, we show that the student can generate concepts unseen in the training images. When applied to conditional diffusion model distillation, our method allows the student to explore the condition space without generating condition-specific images, resulting in notable improvements in both generation quality and efficiency. This promotes resource-efficient deployment of generative diffusion models, broadening their accessibility for both research and real-world applications. Code, models, and datasets are available at https://dohyun-as.github.io/Random-Conditioning .

Random Conditioning with Distillation for Data-Efficient Diffusion Model Compression

TL;DR

The paper tackles data efficiency in distilling conditional diffusion models by introducing Random Conditioning, which pairs noised inputs with randomly chosen text prompts to expand the conditioning space without generating large image-text datasets. By integrating losses on noise prediction and intermediate features, the method distills teacher diffusion models into smaller student models in an image-free setting, enabling generation of concepts outside the training prompts. Empirical results across LAION-derived and MS-COCO data show substantial gains in FID, IS, and CLIP scores over naïve baselines, with comparable performance to teacher models when using random conditioning and data augmentation, and even enabling unseen-concept generation without real images. The approach yields data-efficient, resource-friendly diffusion model compression, achieving strong performance with block- and channel-based architectures and opening avenues for deploying diffusion models in data-constrained environments and diverse modalities.

Abstract

Diffusion models generate high-quality images through progressive denoising but are computationally intensive due to large model sizes and repeated sampling. Knowledge distillation, which transfers knowledge from a complex teacher to a simpler student model, has been widely studied in recognition tasks, particularly for transferring concepts unseen during student training. However, its application to diffusion models remains underexplored, especially in enabling student models to generate concepts not covered by the training images. In this work, we propose Random Conditioning, a novel approach that pairs noised images with randomly selected text conditions to enable efficient, image-free knowledge distillation. By leveraging this technique, we show that the student can generate concepts unseen in the training images. When applied to conditional diffusion model distillation, our method allows the student to explore the condition space without generating condition-specific images, resulting in notable improvements in both generation quality and efficiency. This promotes resource-efficient deployment of generative diffusion models, broadening their accessibility for both research and real-world applications. Code, models, and datasets are available at https://dohyun-as.github.io/Random-Conditioning .

Paper Structure

This paper contains 24 sections, 3 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Qualitative Comparison of Baseline and Our Method Trained Without Animal Image Data. We train models on a dataset excluding animal-related images, both without and with random conditioning. Each row represents (from top to bottom) the teacher model, the model trained without random conditioning, and the model trained with random conditioning. In (a), samples are generated conditioned on captions unrelated to animals, and in (b), samples are generated conditioned on captions related to animals. The captions used to generate these samples are provided in \ref{['detail_prompts']} of the Supp. Mat. for reference.
  • Figure 2: Generated MNIST Images of Distilled and Excluded Digits by Teacher and Student. When the student is distilled using a dataset containing only a subset of digits, it fails to generate the excluded digit ('3’). Images from both the teacher and student models are generated with the same random seed for comparison.
  • Figure 3: Effects of Altered Conditioning on Generated Results from an Input Image across Timesteps. Generated results conditioned on the rightmost column using the input image from the leftmost column at each timestep for both MNIST mnist and MSCOCO mscoco. First, $\mathbf{x}_t$ is derived from the initial image $\mathbf{x}_0$, associated with the image label, at the timestep $t$ shown above each image using the forward process and then, $\mathbf{x}_0$ is regenerated through the reverse process, conditioned on the displayed rightmost column.
  • Figure 4: Overview of the Random Conditioning Approach. When distilling knowledge from the teacher model to a smaller student model, instead of pairing each training image dataset sample $\mathbf{x}_{t}^{n}$ with its original condition $c^n$, we replace it with a random condition $\Tilde{c}$ from the text dataset based on a predefined probability $p(t)$ at each timestep $t$. This approach enables the student model to learn the teacher's behavior even for conditions without explicit image pairs.
  • Figure 5: Distributions of $p(\mathbf{x}_t|c^n)$ and $p(\mathbf{x}_t|\Tilde{c})$. Visualization of the distributions of toy 2D data samples at timesteps 200, 400, 600, and 800, along with corresponding $\mathbf{x}_t$ images at each timestep. As the timestep increases, the distributions progressively overlap with each other.
  • ...and 8 more figures