Table of Contents
Fetching ...

Cross-Resolution Distribution Matching for Diffusion Distillation

Feiyang Chen, Hongpeng Pan, Haonan Xu, Xinyu Duan, Yang Yang, Zhefeng Wang

TL;DR

Cross-Resolution Distribution Matching Distillation is proposed, a novel distillation framework that bridges cross-resolution distribution gaps for high-fidelity, few-step multi-resolution cascaded inference and preserves high-fidelity generation while accelerating inference across various backbones.

Abstract

Diffusion distillation is central to accelerating image and video generation, yet existing methods are fundamentally limited by the denoising process, where step reduction has largely saturated. Partial timestep low-resolution generation can further accelerate inference, but it suffers noticeable quality degradation due to cross-resolution distribution gaps. We propose Cross-Resolution Distribution Matching Distillation (RMD), a novel distillation framework that bridges cross-resolution distribution gaps for high-fidelity, few-step multi-resolution cascaded inference. Specifically, RMD divides the timestep intervals for each resolution using logarithmic signal-to-noise ratio (logSNR) curves, and introduces logSNR-based mapping to compensate for resolution-induced shifts. Distribution matching is conducted along resolution trajectories to reduce the gap between low-resolution generator distributions and the teacher's high-resolution distribution. In addition, a predicted-noise re-injection mechanism is incorporated during upsampling to stabilize training and improve synthesis quality. Quantitative and qualitative results show that RMD preserves high-fidelity generation while accelerating inference across various backbones. Notably, RMD achieves up to 33.4X speedup on SDXL and 25.6X on Wan2.1-14B, while preserving high visual fidelity.

Cross-Resolution Distribution Matching for Diffusion Distillation

TL;DR

Cross-Resolution Distribution Matching Distillation is proposed, a novel distillation framework that bridges cross-resolution distribution gaps for high-fidelity, few-step multi-resolution cascaded inference and preserves high-fidelity generation while accelerating inference across various backbones.

Abstract

Diffusion distillation is central to accelerating image and video generation, yet existing methods are fundamentally limited by the denoising process, where step reduction has largely saturated. Partial timestep low-resolution generation can further accelerate inference, but it suffers noticeable quality degradation due to cross-resolution distribution gaps. We propose Cross-Resolution Distribution Matching Distillation (RMD), a novel distillation framework that bridges cross-resolution distribution gaps for high-fidelity, few-step multi-resolution cascaded inference. Specifically, RMD divides the timestep intervals for each resolution using logarithmic signal-to-noise ratio (logSNR) curves, and introduces logSNR-based mapping to compensate for resolution-induced shifts. Distribution matching is conducted along resolution trajectories to reduce the gap between low-resolution generator distributions and the teacher's high-resolution distribution. In addition, a predicted-noise re-injection mechanism is incorporated during upsampling to stabilize training and improve synthesis quality. Quantitative and qualitative results show that RMD preserves high-fidelity generation while accelerating inference across various backbones. Notably, RMD achieves up to 33.4X speedup on SDXL and 25.6X on Wan2.1-14B, while preserving high visual fidelity.
Paper Structure (25 sections, 11 equations, 12 figures, 6 tables, 2 algorithms)

This paper contains 25 sections, 11 equations, 12 figures, 6 tables, 2 algorithms.

Figures (12)

  • Figure 1: Under the same text prompt and random seed, the SDXL model produces distinct image at 512×512 and 1024×1024 resolutions, revealing a clear resolution-dependent distribution shift. The corresponding prompts are provided in Appendix \ref{['appendix:prompts']}.
  • Figure 2: Overview of the RMD Framework, which compresses the denoising trajectory of a pretrained diffusion model into a multi-resolution, few-step cascaded denoising process.
  • Figure 3: Illustration of Cross-Resolution Timestep interval Alignment. logSNR curves at different resolutions show resolution-dependent variations in noising dynamics.
  • Figure 4: The overall pipeline for distilling a pretrained diffusion model into a cascaded generator $G_\theta$ that performs generation across two resolution distribution spaces.
  • Figure 5: Visual comparison of our model with DMD2, TDM, and the base model. All distilled models are evaluated with 6 sampling steps, while the teacher model uses 50 steps with classifier-free guidance. For fair comparison, all results are generated using identical noise seeds and text prompts. Left prompt: A basketball player soaring through the air for a thunderous slam dunk. Right prompt: A frog jumping into a bowl of water, splashing droplets onto the surrounding grass.
  • ...and 7 more figures