Table of Contents
Fetching ...

Accelerating Diffusion Decoders via Multi-Scale Sampling and One-Step Distillation

Chuhan Wang, Hao Chen

Abstract

Image tokenization plays a central role in modern generative modeling by mapping visual inputs into compact representations that serve as an intermediate signal between pixels and generative models. Diffusion-based decoders have recently been adopted in image tokenization to reconstruct images from latent representations with high perceptual fidelity. In contrast to diffusion models used for downstream generation, these decoders are dedicated to faithful reconstruction rather than content generation. However, their iterative sampling process introduces significant latency, making them impractical for real-time or large-scale applications. In this work, we introduce a two-stage acceleration framework to address this inefficiency. First, we propose a multi-scale sampling strategy, where decoding begins at a coarse resolution and progressively refines the output by doubling the resolution at each stage, achieving a theoretical speedup of $\mathcal{O}(\log n)$ compared to standard full-resolution sampling. Second, we distill the diffusion decoder at each scale into a single-step denoising model, enabling fast and high-quality reconstructions in a single forward pass per scale. Together, these techniques yield an order-of-magnitude reduction in decoding time with little degradation in output quality. Our approach provides a practical pathway toward efficient yet expressive image tokenizers. We hope it serves as a foundation for future work in efficient visual tokenization and downstream generation.

Accelerating Diffusion Decoders via Multi-Scale Sampling and One-Step Distillation

Abstract

Image tokenization plays a central role in modern generative modeling by mapping visual inputs into compact representations that serve as an intermediate signal between pixels and generative models. Diffusion-based decoders have recently been adopted in image tokenization to reconstruct images from latent representations with high perceptual fidelity. In contrast to diffusion models used for downstream generation, these decoders are dedicated to faithful reconstruction rather than content generation. However, their iterative sampling process introduces significant latency, making them impractical for real-time or large-scale applications. In this work, we introduce a two-stage acceleration framework to address this inefficiency. First, we propose a multi-scale sampling strategy, where decoding begins at a coarse resolution and progressively refines the output by doubling the resolution at each stage, achieving a theoretical speedup of compared to standard full-resolution sampling. Second, we distill the diffusion decoder at each scale into a single-step denoising model, enabling fast and high-quality reconstructions in a single forward pass per scale. Together, these techniques yield an order-of-magnitude reduction in decoding time with little degradation in output quality. Our approach provides a practical pathway toward efficient yet expressive image tokenizers. We hope it serves as a foundation for future work in efficient visual tokenization and downstream generation.
Paper Structure (24 sections, 17 equations, 4 figures, 3 tables)

This paper contains 24 sections, 17 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Left: Our two-stage framework reconstructs images through coarse-to-fine sampling and single-step denoising at each scale. Right: Comparison of image tokenizers on rFID and log throughput; shading indicates the throughput-to-rFID ratio. Our method (red star) delivers state-of-the-art efficiency while maintaining strong reconstruction fidelity.
  • Figure 2: Overview of our two-stage acceleration framework for diffusion decoding. (a) In Stage 1, the decoder progressively reconstructs the image through multi-scale denoising, starting from pure noise at low resolution and upsampling through four spatial scales to obtain a final reconstruction. (b) In Stage 2, this trained decoder is used as the teacher model to supervise a student decoder that performs single-step denoising at each scale. The student is trained with guidance from the teacher outputs, an auxiliary discriminator, and perceptual and reconstruction losses, all conditioned on the same latent representation encoded from the input image.
  • Figure 3: Representative reconstructions. Top: ground truth; middle: Stage-1 multi-scale model (30 steps/scale); bottom: Stage-2 distilled model (1 step/scale, 4 scales). The distilled decoder preserves visual fidelity while cutting the total denoising steps by $\sim30\times$.
  • Figure 4: Effect of cfg on Stage-1 training after 200 epochs. Left: rFID, SSIM, and PSNR as cfg varies. Right: Reconstruction examples for cfg = 1, 2, 3 (top to bottom). A cfg value around 2 offers the best balance of fidelity and perceptual quality.