Table of Contents
Fetching ...

Inference-Time Diffusion Model Distillation

Geon Yeong Park, Sang Wan Lee, Jong Chul Ye

TL;DR

Diffusion models suffer from slow sampling and a persistent gap between fast, distilled student models and their high-quality teacher counterparts. The authors propose Distillation++—an inference-time, tuning-free distillation framework that uses a score distillation sampling loss and teacher guidance during sampling to steer the student trajectory toward the teacher's clean manifold, without requiring extra data. The method generalizes to text-conditioned sampling and multiple solvers via a simple interpolation-based update, with guidance applied in early steps to achieve large gains in fidelity and semantic alignment at modest computational cost. Experiments on SDXL-based baselines show consistent improvements over state-of-the-art distillation methods, validating the practicality of tuning-free, inference-time distillation for diffusion models.

Abstract

Diffusion distillation models effectively accelerate reverse sampling by compressing the process into fewer steps. However, these models still exhibit a performance gap compared to their pre-trained diffusion model counterparts, exacerbated by distribution shifts and accumulated errors during multi-step sampling. To address this, we introduce Distillation++, a novel inference-time distillation framework that reduces this gap by incorporating teacher-guided refinement during sampling. Inspired by recent advances in conditional sampling, our approach recasts student model sampling as a proximal optimization problem with a score distillation sampling loss (SDS). To this end, we integrate distillation optimization during reverse sampling, which can be viewed as teacher guidance that drives student sampling trajectory towards the clean manifold using pre-trained diffusion models. Thus, Distillation++ improves the denoising process in real-time without additional source data or fine-tuning. Distillation++ demonstrates substantial improvements over state-of-the-art distillation baselines, particularly in early sampling stages, positioning itself as a robust guided sampling process crafted for diffusion distillation models. Code: https://github.com/geonyeong-park/inference_distillation.

Inference-Time Diffusion Model Distillation

TL;DR

Diffusion models suffer from slow sampling and a persistent gap between fast, distilled student models and their high-quality teacher counterparts. The authors propose Distillation++—an inference-time, tuning-free distillation framework that uses a score distillation sampling loss and teacher guidance during sampling to steer the student trajectory toward the teacher's clean manifold, without requiring extra data. The method generalizes to text-conditioned sampling and multiple solvers via a simple interpolation-based update, with guidance applied in early steps to achieve large gains in fidelity and semantic alignment at modest computational cost. Experiments on SDXL-based baselines show consistent improvements over state-of-the-art distillation methods, validating the practicality of tuning-free, inference-time distillation for diffusion models.

Abstract

Diffusion distillation models effectively accelerate reverse sampling by compressing the process into fewer steps. However, these models still exhibit a performance gap compared to their pre-trained diffusion model counterparts, exacerbated by distribution shifts and accumulated errors during multi-step sampling. To address this, we introduce Distillation++, a novel inference-time distillation framework that reduces this gap by incorporating teacher-guided refinement during sampling. Inspired by recent advances in conditional sampling, our approach recasts student model sampling as a proximal optimization problem with a score distillation sampling loss (SDS). To this end, we integrate distillation optimization during reverse sampling, which can be viewed as teacher guidance that drives student sampling trajectory towards the clean manifold using pre-trained diffusion models. Thus, Distillation++ improves the denoising process in real-time without additional source data or fine-tuning. Distillation++ demonstrates substantial improvements over state-of-the-art distillation baselines, particularly in early sampling stages, positioning itself as a robust guided sampling process crafted for diffusion distillation models. Code: https://github.com/geonyeong-park/inference_distillation.

Paper Structure

This paper contains 15 sections, 21 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Experiments comparing baselines (LCM-LoRA luo2023lcmlora, SDXL-Lightning lin2024sdxl, DMD2 yin2024improved), teacher model (SDXL podell2023sdxl), and the proposed framework. Results are generated with 4 step sampling. We improve the visual fidelity and textual alignment of student baselines by conducting a inference-time diffusion distillation with the guidance of teacher model, e.g. SDXL, in early sampling stages (first 1 step).
  • Figure 2: Overview. (a) Diffusion models (in blue) sample by solving the PF-ODE, requiring a computationally expensive integral from time $T$ to 0. Student models (in black) accelerate sampling by approximating this integral, but their (initial) estimates are often suboptimal. (b) To bridge this gap post-training, we propose an inference-time distillation. Specifically, we refine the student models' initial estimates by refining them towards teacher estimates, obtained by consecutive renoising and denoising, as in (\ref{['eq: teacher guidance2']}). (c) This process functions as a form of teacher guidance in (\ref{['eq: teacher guidance3']}), steering the sampling trajectory closer to the teacher model's distribution, thereby (d) improving the sampling path.
  • Figure 3: Qualitative comparisons against state-of-the-art distillation baselines. Baselines using 4 sampling steps: SDXL-Lightning, DMD2, SDXL-Turbo. Baselines using 8 sampling steps: LCM, LCM-LoRA. By conducting the inferece-time distillation in early sampling stages, we reduce artifacts, improve the visual fidelity and textual alignment.
  • Figure 4: (a) Results of baseline (LCM-LoRA) with varying number of sampling steps (4, 6, 7, 8). Increasing the number of sampling steps of student models does not guarantee improvements in textual alignment or physical feasibility. (b) Our improved results with inference-time distillation. Teacher guidance is applied only at the first of 8 steps (total step=8+1).
  • Figure 5: (a) Results of baseline (LCM) with 4 and 8 sampling steps. (b) Ours with 4 step sampling + 1 step distillation.
  • ...and 4 more figures