Table of Contents
Fetching ...

From Structure to Detail: Hierarchical Distillation for Efficient Diffusion Model

Hanbo Cheng, Peng Wang, Kaixiang Lei, Qi Li, Zhen Zou, Pengfei Hu, Jun Du

TL;DR

This work analyzes the fundamental bottleneck of trajectory-based distillation (TD) in diffusion models as lossy compression of high-frequency details and combines it with distribution matching distillation (DMD) in a two-stage Hierarchical Distillation (HD) pipeline. Stage 1 uses MeanFlow-based TD to inject a strong structural prior, creating a well-posed initialization, while Stage 2 applies DMD with an Adaptive Weighted Discriminator (AWD) to refine details and preserve diversity. Theoretical unification shows TD objectives converge to mean-velocity estimation, explaining fidelity limits, which HD overcomes by splitting structure and detail refinement. Empirically, HD achieves state-of-the-art single-step diffusion results on ImageNet $256\times256$ (FID $\approx 2.26$, rivaling a 250-step teacher) and strong performance on MJHQ, while reducing FLOPs by about $69.7\times$, establishing a practical path to real-time high-fidelity diffusion generation.

Abstract

The inference latency of diffusion models remains a critical barrier to their real-time application. While trajectory-based and distribution-based step distillation methods offer solutions, they present a fundamental trade-off. Trajectory-based methods preserve global structure but act as a "lossy compressor", sacrificing high-frequency details. Conversely, distribution-based methods can achieve higher fidelity but often suffer from mode collapse and unstable training. This paper recasts them from independent paradigms into synergistic components within our novel Hierarchical Distillation (HD) framework. We leverage trajectory distillation not as a final generator, but to establish a structural ``sketch", providing a near-optimal initialization for the subsequent distribution-based refinement stage. This strategy yields an ideal initial distribution that enhances the ceiling of overall performance. To further improve quality, we introduce and refine the adversarial training process. We find standard discriminator structures are ineffective at refining an already high-quality generator. To overcome this, we introduce the Adaptive Weighted Discriminator (AWD), tailored for the HD pipeline. By dynamically allocating token weights, AWD focuses on local imperfections, enabling efficient detail refinement. Our approach demonstrates state-of-the-art performance across diverse tasks. On ImageNet $256\times256$, our single-step model achieves an FID of 2.26, rivaling its 250-step teacher. It also achieves promising results on the high-resolution text-to-image MJHQ benchmark, proving its generalizability. Our method establishes a robust new paradigm for high-fidelity, single-step diffusion models.

From Structure to Detail: Hierarchical Distillation for Efficient Diffusion Model

TL;DR

This work analyzes the fundamental bottleneck of trajectory-based distillation (TD) in diffusion models as lossy compression of high-frequency details and combines it with distribution matching distillation (DMD) in a two-stage Hierarchical Distillation (HD) pipeline. Stage 1 uses MeanFlow-based TD to inject a strong structural prior, creating a well-posed initialization, while Stage 2 applies DMD with an Adaptive Weighted Discriminator (AWD) to refine details and preserve diversity. Theoretical unification shows TD objectives converge to mean-velocity estimation, explaining fidelity limits, which HD overcomes by splitting structure and detail refinement. Empirically, HD achieves state-of-the-art single-step diffusion results on ImageNet (FID , rivaling a 250-step teacher) and strong performance on MJHQ, while reducing FLOPs by about , establishing a practical path to real-time high-fidelity diffusion generation.

Abstract

The inference latency of diffusion models remains a critical barrier to their real-time application. While trajectory-based and distribution-based step distillation methods offer solutions, they present a fundamental trade-off. Trajectory-based methods preserve global structure but act as a "lossy compressor", sacrificing high-frequency details. Conversely, distribution-based methods can achieve higher fidelity but often suffer from mode collapse and unstable training. This paper recasts them from independent paradigms into synergistic components within our novel Hierarchical Distillation (HD) framework. We leverage trajectory distillation not as a final generator, but to establish a structural ``sketch", providing a near-optimal initialization for the subsequent distribution-based refinement stage. This strategy yields an ideal initial distribution that enhances the ceiling of overall performance. To further improve quality, we introduce and refine the adversarial training process. We find standard discriminator structures are ineffective at refining an already high-quality generator. To overcome this, we introduce the Adaptive Weighted Discriminator (AWD), tailored for the HD pipeline. By dynamically allocating token weights, AWD focuses on local imperfections, enabling efficient detail refinement. Our approach demonstrates state-of-the-art performance across diverse tasks. On ImageNet , our single-step model achieves an FID of 2.26, rivaling its 250-step teacher. It also achieves promising results on the high-resolution text-to-image MJHQ benchmark, proving its generalizability. Our method establishes a robust new paradigm for high-fidelity, single-step diffusion models.

Paper Structure

This paper contains 16 sections, 4 theorems, 33 equations, 8 figures, 4 tables.

Key Result

Proposition 1

Continuous Consistency Models implicitly model the mean velocity over the interval $[0, t]$.

Figures (8)

  • Figure 1: Comparison of generation quality between the 50-step teacher, SANA SANA, and our 1-step HD method. Our approach achieves comparable quality to the multi-step teacher.
  • Figure 2: The Hierarchical Distillation (HD) Pipeline. Our method consists of two main stages: (1) Structured Initialization: A MeanFlow-based approach imbues the student with foundational structural information. (2) Distribution Refinement: A second stage restores high-frequency details, employing our Adaptive Weighted Discriminator (AWD) which was specifically designed for the HD framework. The "SN" and "LN" refer to spectral norm SN and layer norm respectively.
  • Figure 3: Performance of Trajectory Distillation (TD) vs. Model Size. The upper bound of TD performance increases with the number of model parameters.
  • Figure 4: Qualitative comparisons with previous methods based on fully fine-tuned SANA SANA.
  • Figure 5: (a) Visualization of the prior (noise) and target distributions. (b) the model architecture used in Section \ref{['sec:toy_experiment']}
  • ...and 3 more figures

Theorems & Definitions (8)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 1
  • proof
  • Proposition 2
  • proof