Table of Contents
Fetching ...

1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation

Haoyu Li, Tingyan Wen, Lin Qi, Zhe Wu, Yihuang Chen, Xing Zhou, Lifei Zhu, Xueqian Wang, Kai Zhang

Abstract

Diffusion models produce high-quality text-to-image results, but their iterative denoising is computationally expensive.Distribution Matching Distillation (DMD) emerges as a promising path to few-step distillation, but suffers from diversity collapse and fidelity degradation when reduced to two steps or fewer. We present 1.x-Distill, the first fractional-step distillation framework that breaks the integer-step constraint of prior few-step methods and establishes 1.x-step generation as a practical regime for distilled diffusion models.Specifically, we first analyze the overlooked role of teacher CFG in DMD and introduce a simple yet effective modification to suppress mode collapse. Then, to improve performance under extreme steps, we introduce Stagewise Focused Distillation, a two-stage strategy that learns coarse structure through diversity-preserving distribution matching and refines details with inference-consistent adversarial distillation. Furthermore, we design a lightweight compensation module for Distill--Cache co-Training, which naturally incorporates block-level caching into our distillation pipeline.Experiments on SD3-Medium and SD3.5-Large show that 1.x-Distill surpasses prior few-step methods, achieving better quality and diversity at 1.67 and 1.74 effective NFEs, respectively, with up to 33x speedup over original 28x2 NFE sampling.

1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation

Abstract

Diffusion models produce high-quality text-to-image results, but their iterative denoising is computationally expensive.Distribution Matching Distillation (DMD) emerges as a promising path to few-step distillation, but suffers from diversity collapse and fidelity degradation when reduced to two steps or fewer. We present 1.x-Distill, the first fractional-step distillation framework that breaks the integer-step constraint of prior few-step methods and establishes 1.x-step generation as a practical regime for distilled diffusion models.Specifically, we first analyze the overlooked role of teacher CFG in DMD and introduce a simple yet effective modification to suppress mode collapse. Then, to improve performance under extreme steps, we introduce Stagewise Focused Distillation, a two-stage strategy that learns coarse structure through diversity-preserving distribution matching and refines details with inference-consistent adversarial distillation. Furthermore, we design a lightweight compensation module for Distill--Cache co-Training, which naturally incorporates block-level caching into our distillation pipeline.Experiments on SD3-Medium and SD3.5-Large show that 1.x-Distill surpasses prior few-step methods, achieving better quality and diversity at 1.67 and 1.74 effective NFEs, respectively, with up to 33x speedup over original 28x2 NFE sampling.

Paper Structure

This paper contains 46 sections, 11 equations, 17 figures, 7 tables, 2 algorithms.

Figures (17)

  • Figure 1: Visual results.1.x-Distill mitigates the mode collapse and quality degradation of vanilla DMD under extreme step reduction, delivering superior few-step results.
  • Figure 2: An illustration of the effect of the teacher’s CFG in distillation. At the high-noise timestep $t$, teacher estimation with strong guidance ${\color{red}{v^\text{cfg}_\text{real}}}={\color{blue}{v_{\text{real},\text{c}}}}+(w-1)(v_{\text{real},\text{c}}-v_{\text{real},\emptyset})$ tends to drives the student to collapse prematurely toward dominant modes. We propose to disable teacher CFG at $t\in(0,\alpha]$ during distribution matching, encouraging the student to cover more modes during early denoising trajectory.
  • Figure 3: Overview of 1.x-Distill. Our guidance control (\ref{['sec:cfg']}) and cache design (\ref{['sec:cache']}) are both constructed in the two-stage framework (\ref{['sec:SFD']}). Stage I: Train the generator with DMD loss. Within DMD framework, we apply importance sampling on diffusion timestep $t$, and control the guidance according to sampled $t$ when compute the real score. Stage II: Train the generator with pixel-space adversarial loss. Our GAN framework produces $\hat{x}_0$ along generator inference path, which naturally incorporates block-cache design. The generator and MLP module are jointly optimized.
  • Figure 4: Importance sampling in Stage I. Left: Under teacher scheduler (shift=3.0), we split timesteps from 1.0 to 0.0 into four windows to probe their effects. Right: Uniform sampling treats all timesteps equally, while our importance sampling down-weights less informative ones and concentrates training on the more reliable region.
  • Figure 5: Caching for a distilled 2-step student. (a) We measure block-wise reuse error as the contribution change across adjacent steps on SD3-M, $e_n=\lVert \Delta_{n,t+1}-\Delta_{n,t}\rVert_1$, where $\Delta_{n,t}=O_{n,t}-I_{n,t}$. Early blocks exhibit consistently small $e_n$, indicating strong temporal redundancy and low reuse error. (b) Leveraging this property, we cache the contribution of a block segment $[n,m]$ at step $t_0$, $\Delta_0=O_{m,0}-I_{n,0}$, skip the segment at $t_1$, and recover the output via $\hat{O}_{m,1}=I_{n,1}+f({\color{red}\Delta_0})$.
  • ...and 12 more figures