Table of Contents
Fetching ...

Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield

Dongyang Liu, Peng Gao, David Liu, Ruoyi Du, Zhen Li, Qilong Wu, Xin Jin, Sihan Cao, Shifeng Zhang, Hongsheng Li, Steven Hoi

TL;DR

<3-5 sentence high-level summary> The paper reframes diffusion-model distillation by decoupling CFG Augmentation (CA) from Distribution Matching (DM): CA acts as the engine that enables few-step generation, while DM serves as a stabilizing regularizer. Through a gradient decomposition and ablation studies, it shows CA largely drives the multi-to-few-step conversion, whereas DM prevents training collapse and artifacts. It further demonstrates that decoupled re-noising schedules for CA and DM yield tangible performance gains, validated on large-scale text-to-image pipelines and real-world applications like Z-Image. This principled perspective offers a more robust framework for designing efficient diffusion-based generators with improved stability and quality.</paper_summary>

Abstract

Diffusion model distillation has emerged as a powerful technique for creating efficient few-step and single-step generators. Among these, Distribution Matching Distillation (DMD) and its variants stand out for their impressive performance, which is widely attributed to their core mechanism of matching the student's output distribution to that of a pre-trained teacher model. In this work, we challenge this conventional understanding. Through a rigorous decomposition of the DMD training objective, we reveal that in complex tasks like text-to-image generation, where CFG is typically required for desirable few-step performance, the primary driver of few-step distillation is not distribution matching, but a previously overlooked component we identify as CFG Augmentation (CA). We demonstrate that this term acts as the core ``engine'' of distillation, while the Distribution Matching (DM) term functions as a ``regularizer'' that ensures training stability and mitigates artifacts. We further validate this decoupling by demonstrating that while the DM term is a highly effective regularizer, it is not unique; simpler non-parametric constraints or GAN-based objectives can serve the same stabilizing function, albeit with different trade-offs. This decoupling of labor motivates a more principled analysis of the properties of both terms, leading to a more systematic and in-depth understanding. This new understanding further enables us to propose principled modifications to the distillation process, such as decoupling the noise schedules for the engine and the regularizer, leading to further performance gains. Notably, our method has been adopted by the Z-Image ( https://github.com/Tongyi-MAI/Z-Image ) project to develop a top-tier 8-step image generation model, empirically validating the generalization and robustness of our findings.

Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield

TL;DR

<3-5 sentence high-level summary> The paper reframes diffusion-model distillation by decoupling CFG Augmentation (CA) from Distribution Matching (DM): CA acts as the engine that enables few-step generation, while DM serves as a stabilizing regularizer. Through a gradient decomposition and ablation studies, it shows CA largely drives the multi-to-few-step conversion, whereas DM prevents training collapse and artifacts. It further demonstrates that decoupled re-noising schedules for CA and DM yield tangible performance gains, validated on large-scale text-to-image pipelines and real-world applications like Z-Image. This principled perspective offers a more robust framework for designing efficient diffusion-based generators with improved stability and quality.</paper_summary>

Abstract

Diffusion model distillation has emerged as a powerful technique for creating efficient few-step and single-step generators. Among these, Distribution Matching Distillation (DMD) and its variants stand out for their impressive performance, which is widely attributed to their core mechanism of matching the student's output distribution to that of a pre-trained teacher model. In this work, we challenge this conventional understanding. Through a rigorous decomposition of the DMD training objective, we reveal that in complex tasks like text-to-image generation, where CFG is typically required for desirable few-step performance, the primary driver of few-step distillation is not distribution matching, but a previously overlooked component we identify as CFG Augmentation (CA). We demonstrate that this term acts as the core ``engine'' of distillation, while the Distribution Matching (DM) term functions as a ``regularizer'' that ensures training stability and mitigates artifacts. We further validate this decoupling by demonstrating that while the DM term is a highly effective regularizer, it is not unique; simpler non-parametric constraints or GAN-based objectives can serve the same stabilizing function, albeit with different trade-offs. This decoupling of labor motivates a more principled analysis of the properties of both terms, leading to a more systematic and in-depth understanding. This new understanding further enables us to propose principled modifications to the distillation process, such as decoupling the noise schedules for the engine and the regularizer, leading to further performance gains. Notably, our method has been adopted by the Z-Image ( https://github.com/Tongyi-MAI/Z-Image ) project to develop a top-tier 8-step image generation model, empirically validating the generalization and robustness of our findings.

Paper Structure

This paper contains 23 sections, 8 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Two perspectives on the DMD algorithm. (a) The conventional view, which treats the use of CFG as a heuristic relaxation of the theoretical framework, with the algorithm's success solely attributed to this (relaxed) distribution matching mechanism. (b) Our proposed decoupled view, where the objective is a combination of two distinct mechanisms: a CFG Augmentation (CA) engine that drives the few-step conversion, and a Distribution Matching (DM) regularizer—which strictly adheres to the theoretical derivation (Eq. \ref{['eq:ikl_definition']})—that ensures training stability.
  • Figure 2: Ablation study on the roles of CFG Augmentation (CA) and Distribution Matching (DM). Numerical indicators are evaluated on 1k sampled prompts from COCO-10k lin2014microsoft.
  • Figure 3: CFG Augmentation with different regularizers. Image Reward and HPS v2.1 evaluated on 1k sampled prompts from COCO-10k. Setting: 4-step SDXL. See Fig. \ref{['fig:reg_visual']} for visualized samples.
  • Figure 4: (a) Visualization on the effect of re-noising timestep $\tau$ in CFG Augmentation (CA). The generator is trained with CA alone. In our notation, $\tau=0$ corresponds to pure noise and $\tau=1$ to clean data. (b) Illustration of the DM corrective mechanism. The generator is trained with CA alone, while the fake model keeps training on the generator's output as in DMD.
  • Figure 5: Un-cherry-picked qualitative comparison of different re-noising schedule configurations. Top row: ➁ Decoupled-Full, $\tau_{\text{CA}}, \tau_{\text{DM}} \in [0, 1]$. Middle row: ➂ Coupled-Constrained, $\tau_{\text{CA}}, \tau_{\text{DM}} > t$. Bottom row: ➃ our proposed Decoupled-Hybrid, $\tau_{\text{CA}} > t, \tau_{\text{DM}} \in [0, 1]$.
  • ...and 1 more figures