Table of Contents
Fetching ...

Distilling Diversity and Control in Diffusion Models

Rohit Gandikota, David Bau

TL;DR

Diversity distillation is introduced, a hybrid approach using the base model for only the first critical timestep before switching to the distilled model and providing both causal validation and theoretical support showing why the very first timestep concentrates the diversity bottleneck in distilled models.

Abstract

Distilled diffusion models generate images in far fewer timesteps but suffer from reduced sample diversity when generating multiple outputs from the same prompt. To understand this phenomenon, we first investigate whether distillation damages concept representations by examining if the required diversity is properly learned. Surprisingly, distilled models retain the base model's representational structure: control mechanisms like Concept Sliders and LoRAs transfer seamlessly without retraining, and SliderSpace analysis reveals distilled models possess variational directions needed for diversity yet fail to activate them. This redirects our investigation to understanding how the generation dynamics differ between base and distilled models. Using $\hat{\mathbf{x}}_{0}$ trajectory visualization, we discover distilled models commit to their final image structure almost immediately at the first timestep, while base models distribute structural decisions across many steps. To test whether this first-step commitment causes the diversity loss, we introduce diversity distillation, a hybrid approach using the base model for only the first critical timestep before switching to the distilled model. This single intervention restores sample diversity while maintaining computational efficiency. We provide both causal validation and theoretical support showing why the very first timestep concentrates the diversity bottleneck in distilled models. Our code and data are available at https://distillation.baulab.info/

Distilling Diversity and Control in Diffusion Models

TL;DR

Diversity distillation is introduced, a hybrid approach using the base model for only the first critical timestep before switching to the distilled model and providing both causal validation and theoretical support showing why the very first timestep concentrates the diversity bottleneck in distilled models.

Abstract

Distilled diffusion models generate images in far fewer timesteps but suffer from reduced sample diversity when generating multiple outputs from the same prompt. To understand this phenomenon, we first investigate whether distillation damages concept representations by examining if the required diversity is properly learned. Surprisingly, distilled models retain the base model's representational structure: control mechanisms like Concept Sliders and LoRAs transfer seamlessly without retraining, and SliderSpace analysis reveals distilled models possess variational directions needed for diversity yet fail to activate them. This redirects our investigation to understanding how the generation dynamics differ between base and distilled models. Using trajectory visualization, we discover distilled models commit to their final image structure almost immediately at the first timestep, while base models distribute structural decisions across many steps. To test whether this first-step commitment causes the diversity loss, we introduce diversity distillation, a hybrid approach using the base model for only the first critical timestep before switching to the distilled model. This single intervention restores sample diversity while maintaining computational efficiency. We provide both causal validation and theoretical support showing why the very first timestep concentrates the diversity bottleneck in distilled models. Our code and data are available at https://distillation.baulab.info/

Paper Structure

This paper contains 29 sections, 3 theorems, 13 equations, 17 figures, 5 tables, 1 algorithm.

Key Result

Lemma 1

Let $\Delta \varepsilon$ denote a perturbation in the noise prediction at timestep $t$. Then the induced change in $\hat{x}_{0|t}$ is

Figures (17)

  • Figure 1: Diversity Distillation: (a) SDXL-Base is very slow (9.22s) but has good sample diversity for the prompt "Cartoon character", sampling a wide range of styles, creatures, backgrounds, and poses. (b) SDXL-DMD2 is fast (0.64s) but sacrifices diversity. With the same prompt, samples all have the same style, pose, species, and context. (c) We show how the diversity of the base model can be distilled into the fast model by substituting the first timestep, achieving both speed and diversity (0.64s). Control Distillation: (d) Despite the lack of sample diversity in distilled models, control mechanisms like Concept Sliders trained on base models transfer perfectly to distilled variants, demonstrating that the representational structure for diversity exists but is not spontaneously activated during generation.
  • Figure 2: Control directions (Sliders kumari2023multi), customization adapters (Custom Diffusion gandikota2024concept), and variational directions (Sliderspace gandikota2025sliderspace) trained on SDXL-Base transfer to all distilled models without additional finetuning. SliderSpace results suggest that top variation directions that capture sample diversity in the base model exist in distilled models but are not spontaneously activated during generation.
  • Figure 3: $\hat{\mathbf{x}}_{0}$ visualization reveals generation inconsistencies. When prompted with "Image of dog and cat sitting on sofa," the SDXL model produces an image with only a dog. However, $\hat{\mathbf{x}}_{0}$ visualization at $T=10$ shows the model initially conceptualizing a cat face (red box) before abandoning this element in the final generation. This demonstrates how diffusion models can discard semantic elements during the denoising process.
  • Figure 4: Comparison of standard diffusion visualization vs. $\hat{\mathbf{x}}_{0}$ visualization. (a) Standard visualization of intermediate latents shows subtle differences between base and distilled models. (b) $\hat{\mathbf{x}}_{0}$ visualization reveals dramatic differences in how models predict the final output. Distilled models commit to final image structure in the first timestep, while base models gradually refine structure across multiple steps, explaining the observed mode collapse in distilled models.
  • Figure 5: Measuring the dreamsim distance between intermediate $\hat{\mathbf{x}}_{0}$ visualization and final generated image reveals that distilled models establish structural image composition within the initial diffusion step, whereas base models require approximately 30% of steps to achieve comparable structural definition.
  • ...and 12 more figures

Theorems & Definitions (6)

  • Lemma 1: Sensitivity
  • proof
  • Proposition 1: Amplified Diversity Loss
  • proof
  • Theorem 1: First-Timestep Dominance
  • proof