Table of Contents
Fetching ...

Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals

Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu, Fanzhou Wang, Zhiqian Lin, Tianxiang Ren, Dahua Lin, Ruihao Gong, Lei Yang

TL;DR

Phased DMD tackles the challenge of distilling score-based diffusion models into efficient few-step generators without sacrificing diversity or fidelity. It introduces two core ideas: progressive distribution matching across subintervals of the SNR and score matching within those subintervals, forming an emergent Mixture-of-Experts structure that expands capacity while maintaining a manageable training graph. The approach is theoretically grounded, deriving subinterval objectives that preserve unbiased flow matching and stability, and it is validated on state-of-the-art image and video models (e.g., Qwen-Image, Wan2.x), showing improved diversity retention and preservation of key capabilities relative to standard DMD and SGTS. The results suggest Phased DMD enables high-quality, diverse, few-step generation with data-free distillation, offering practical benefits for scalable diffusion-model deployment and MoE-inspired architectures in generative tasks.

Abstract

Distribution Matching Distillation (DMD) distills score-based generative models into efficient one-step generators, without requiring a one-to-one correspondence with the sampling trajectories of their teachers. However, limited model capacity causes one-step distilled models underperform on complex generative tasks, e.g., synthesizing intricate object motions in text-to-video generation. Directly extending DMD to multi-step distillation increases memory usage and computational depth, leading to instability and reduced efficiency. While prior works propose stochastic gradient truncation as a potential solution, we observe that it substantially reduces the generation diversity of multi-step distilled models, bringing it down to the level of their one-step counterparts. To address these limitations, we propose Phased DMD, a multi-step distillation framework that bridges the idea of phase-wise distillation with Mixture-of-Experts (MoE), reducing learning difficulty while enhancing model capacity. Phased DMD is built upon two key ideas: progressive distribution matching and score matching within subintervals. First, our model divides the SNR range into subintervals, progressively refining the model to higher SNR levels, to better capture complex distributions. Next, to ensure the training objective within each subinterval is accurate, we have conducted rigorous mathematical derivations. We validate Phased DMD by distilling state-of-the-art image and video generation models, including Qwen-Image (20B parameters) and Wan2.2 (28B parameters). Experimental results demonstrate that Phased DMD preserves output diversity better than DMD while retaining key generative capabilities. We will release our code and models.

Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals

TL;DR

Phased DMD tackles the challenge of distilling score-based diffusion models into efficient few-step generators without sacrificing diversity or fidelity. It introduces two core ideas: progressive distribution matching across subintervals of the SNR and score matching within those subintervals, forming an emergent Mixture-of-Experts structure that expands capacity while maintaining a manageable training graph. The approach is theoretically grounded, deriving subinterval objectives that preserve unbiased flow matching and stability, and it is validated on state-of-the-art image and video models (e.g., Qwen-Image, Wan2.x), showing improved diversity retention and preservation of key capabilities relative to standard DMD and SGTS. The results suggest Phased DMD enables high-quality, diverse, few-step generation with data-free distillation, offering practical benefits for scalable diffusion-model deployment and MoE-inspired architectures in generative tasks.

Abstract

Distribution Matching Distillation (DMD) distills score-based generative models into efficient one-step generators, without requiring a one-to-one correspondence with the sampling trajectories of their teachers. However, limited model capacity causes one-step distilled models underperform on complex generative tasks, e.g., synthesizing intricate object motions in text-to-video generation. Directly extending DMD to multi-step distillation increases memory usage and computational depth, leading to instability and reduced efficiency. While prior works propose stochastic gradient truncation as a potential solution, we observe that it substantially reduces the generation diversity of multi-step distilled models, bringing it down to the level of their one-step counterparts. To address these limitations, we propose Phased DMD, a multi-step distillation framework that bridges the idea of phase-wise distillation with Mixture-of-Experts (MoE), reducing learning difficulty while enhancing model capacity. Phased DMD is built upon two key ideas: progressive distribution matching and score matching within subintervals. First, our model divides the SNR range into subintervals, progressively refining the model to higher SNR levels, to better capture complex distributions. Next, to ensure the training objective within each subinterval is accurate, we have conducted rigorous mathematical derivations. We validate Phased DMD by distilling state-of-the-art image and video generation models, including Qwen-Image (20B parameters) and Wan2.2 (28B parameters). Experimental results demonstrate that Phased DMD preserves output diversity better than DMD while retaining key generative capabilities. We will release our code and models.

Paper Structure

This paper contains 24 sections, 16 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Schematic diagram of (a) Few-step DMD yin2024DMD2, (b) Few-step DMD with stochastic gradient truncation strategy (SGTS) huang2025_self_forcing, (c) Phased DMD and (d) Phased DMD with SGTS .
  • Figure 2: Sampling trajectories for 200 samples in a 1D toy experiment. (a) Training with the full-interval objective (Eq. \ref{['eq:FlowMatchTarget']}). (b) Training on $0.5 < t < 1$ with the correct subinterval objective (Eq. \ref{['eq:FlowMatchTarget_conditional_singularity']}). (c) Training on $0.5 < t < 1$ with an incorrect target: $\|(\boldsymbol{\psi}_{\boldsymbol{\theta}}({\bm{x}}_t) - (\boldsymbol{\epsilon} - {\bm{x}}_s ) \|^2$.
  • Figure 3: Samples (seeds 0-3) from the Wan2.1-T2V-14B base model (40 steps, CFG=4) and its distilled variants (4 steps, CFG=1): (a) Base, (b) DMD, (c) DMD with SGTS, (d) Phased DMD.
  • Figure 4: Examples generated by the Qwen-Image distilled with Phased DMD.
  • Figure 5: Samples generated with high-SNR experts from different training stages (top: 100 iterations; bottom: 400 iterations) and a shared low-SNR expert. Each column uses identical prompts and seeds.
  • ...and 4 more figures