Table of Contents
Fetching ...

Distribution Backtracking Builds A Faster Convergence Trajectory for Diffusion Distillation

Shengyuan Zhang, Ling Yang, Zejian Li, An Zhao, Chenye Meng, Changyuan Yang, Guang Yang, Zhiyuan Yang, Lingyun Sun

TL;DR

This work tackles the slow sampling speed of diffusion models by introducing Distribution Backtracking Distillation (DisBack), which exploits the entire convergence trajectory between a teacher diffusion model and a student generator. DisBack consists of a Degradation Recording stage that builds a degradation path from the teacher to the initial student, and a Distribution Backtracking stage that reverses this path to guide the student along the teacher’s convergence trajectory, significantly accelerating distillation. Empirical results across CIFAR10, FFHQ-64, ImageNet-64, and text-to-image tasks show that DisBack achieves faster convergence while maintaining or improving generation quality, with substantial improvements over baseline score-distillation methods. The method is simple to implement, orthogonal to existing distillation strategies, and supported by public code, making it broadly applicable to accelerate diffusion-based one-step generation. Overall, DisBack provides a principled, trajectory-aware alternative to endpoint-only distillation, enabling practical high-quality, fast diffusion-based generation in diverse settings.

Abstract

Accelerating the sampling speed of diffusion models remains a significant challenge. Recent score distillation methods distill a heavy teacher model into a student generator to achieve one-step generation, which is optimized by calculating the difference between the two score functions on the samples generated by the student model. However, there is a score mismatch issue in the early stage of the distillation process, because existing methods mainly focus on using the endpoint of pre-trained diffusion models as teacher models, overlooking the importance of the convergence trajectory between the student generator and the teacher model. To address this issue, we extend the score distillation process by introducing the entire convergence trajectory of teacher models and propose Distribution Backtracking Distillation (DisBack). DisBask is composed of two stages: Degradation Recording and Distribution Backtracking. Degradation Recording is designed to obtain the convergence trajectory of the teacher model, which records the degradation path from the trained teacher model to the untrained initial student generator. The degradation path implicitly represents the teacher model's intermediate distributions, and its reverse can be viewed as the convergence trajectory from the student generator to the teacher model. Then Distribution Backtracking trains a student generator to backtrack the intermediate distributions along the path to approximate the convergence trajectory of teacher models. Extensive experiments show that DisBack achieves faster and better convergence than the existing distillation method and accomplishes comparable generation performance, with FID score of 1.38 on ImageNet 64x64 dataset. Notably, DisBack is easy to implement and can be generalized to existing distillation methods to boost performance. Our code is publicly available on https://github.com/SYZhang0805/DisBack.

Distribution Backtracking Builds A Faster Convergence Trajectory for Diffusion Distillation

TL;DR

This work tackles the slow sampling speed of diffusion models by introducing Distribution Backtracking Distillation (DisBack), which exploits the entire convergence trajectory between a teacher diffusion model and a student generator. DisBack consists of a Degradation Recording stage that builds a degradation path from the teacher to the initial student, and a Distribution Backtracking stage that reverses this path to guide the student along the teacher’s convergence trajectory, significantly accelerating distillation. Empirical results across CIFAR10, FFHQ-64, ImageNet-64, and text-to-image tasks show that DisBack achieves faster convergence while maintaining or improving generation quality, with substantial improvements over baseline score-distillation methods. The method is simple to implement, orthogonal to existing distillation strategies, and supported by public code, making it broadly applicable to accelerate diffusion-based one-step generation. Overall, DisBack provides a principled, trajectory-aware alternative to endpoint-only distillation, enabling practical high-quality, fast diffusion-based generation in diverse settings.

Abstract

Accelerating the sampling speed of diffusion models remains a significant challenge. Recent score distillation methods distill a heavy teacher model into a student generator to achieve one-step generation, which is optimized by calculating the difference between the two score functions on the samples generated by the student model. However, there is a score mismatch issue in the early stage of the distillation process, because existing methods mainly focus on using the endpoint of pre-trained diffusion models as teacher models, overlooking the importance of the convergence trajectory between the student generator and the teacher model. To address this issue, we extend the score distillation process by introducing the entire convergence trajectory of teacher models and propose Distribution Backtracking Distillation (DisBack). DisBask is composed of two stages: Degradation Recording and Distribution Backtracking. Degradation Recording is designed to obtain the convergence trajectory of the teacher model, which records the degradation path from the trained teacher model to the untrained initial student generator. The degradation path implicitly represents the teacher model's intermediate distributions, and its reverse can be viewed as the convergence trajectory from the student generator to the teacher model. Then Distribution Backtracking trains a student generator to backtrack the intermediate distributions along the path to approximate the convergence trajectory of teacher models. Extensive experiments show that DisBack achieves faster and better convergence than the existing distillation method and accomplishes comparable generation performance, with FID score of 1.38 on ImageNet 64x64 dataset. Notably, DisBack is easy to implement and can be generalized to existing distillation methods to boost performance. Our code is publicly available on https://github.com/SYZhang0805/DisBack.
Paper Structure (46 sections, 1 theorem, 22 equations, 19 figures, 6 tables, 2 algorithms)

This paper contains 46 sections, 1 theorem, 22 equations, 19 figures, 6 tables, 2 algorithms.

Key Result

Theorem 1

Given $t > 0$, we have,

Figures (19)

  • Figure 1: The comparison of the distillation process between existing SOTA score distillation method Diff-Instruct Diff-Instruct and proposed DisBack on (a) CIFAR10, (b) FFHQ 64x64, and (c) ImageNet 64x64 datasets. The first 200 epochs refer to the computational overhead of the degradation recording stage of DisBack. DisBack achieves a faster convergence speed due to the constraint of the entire convergence trajectory between the student generator and the teacher model.
  • Figure 2: Several examples of 1024$\times$1024 images generated by our proposed one-step DisBack model distilled from SDXL SDXL.
  • Figure 3: The overall framework of DisBack. Stage 1: An auxiliary diffusion model is initialized with the teacher model $s_\theta$ and then fits the distribution of the initial student generator $G_\mathit{stu}^0$. The intermediate checkpoints $\{ s_{\theta_i}^\prime \mid i = 0, \ldots, N \}$ are saved to form a degradation path. The degradation path is then reversed and viewed as the convergence trajectory. Stage 2: The intermediate node $s_{\theta_i}$ along the convergence trajectory is distilled to the student generator sequentially until the generator converges to the distribution of the teacher model.
  • Figure 4: The mismatch degree during the distillation process of Diff-Instruct and proposed DisBack. The standard deviation is visualized. DisBack effectively mitigates the mismatch degree during the entire distillation process.
  • Figure 5: Generation samples by DisBack distilled from SDXL with 1024$\times$1024 resolution.
  • ...and 14 more figures

Theorems & Definitions (1)

  • Theorem 1: The global optimum of training prolificdreamer