Table of Contents
Fetching ...

Scale-wise Distillation of Diffusion Models

Nikita Starodubcev, Denis Kuznedelev, Artem Babenko, Dmitry Baranchuk

TL;DR

SwD introduces a scale-wise diffusion distillation framework that progressively upscales latent resolutions during sampling, enabling next-scale predictions within a single model and reducing compute. The method combines a scale-time scheduling scheme with distribution-matching objectives, notably the novel Patch Distribution Matching (PDM), and integrates with DMD2 using LoRA-equipped teacher and fake models plus a GAN discriminator. Empirically, SwD achieves speedups close to two full-resolution steps while maintaining or exceeding baseline image fidelity on text-to-image diffusion models, with strong quantitative metrics and favorable human judgments. The work demonstrates that operating at lower resolutions at high noise levels is viable and beneficial, and it offers a practical path toward scalable, high-quality diffusion-based generation. The proposed framework is compatible with existing distillation techniques and opens avenues for adaptive scaling and potential video extension.

Abstract

We present SwD, a scale-wise distillation framework for diffusion models (DMs), which effectively employs next-scale prediction ideas for diffusion-based few-step generators. In more detail, SwD is inspired by the recent insights relating diffusion processes to the implicit spectral autoregression. We suppose that DMs can initiate generation at lower data resolutions and gradually upscale the samples at each denoising step without loss in performance while significantly reducing computational costs. SwD naturally integrates this idea into existing diffusion distillation methods based on distribution matching. Also, we enrich the family of distribution matching approaches by introducing a novel patch loss enforcing finer-grained similarity to the target distribution. When applied to state-of-the-art text-to-image diffusion models, SwD approaches the inference times of two full resolution steps and significantly outperforms the counterparts under the same computation budget, as evidenced by automated metrics and human preference studies.

Scale-wise Distillation of Diffusion Models

TL;DR

SwD introduces a scale-wise diffusion distillation framework that progressively upscales latent resolutions during sampling, enabling next-scale predictions within a single model and reducing compute. The method combines a scale-time scheduling scheme with distribution-matching objectives, notably the novel Patch Distribution Matching (PDM), and integrates with DMD2 using LoRA-equipped teacher and fake models plus a GAN discriminator. Empirically, SwD achieves speedups close to two full-resolution steps while maintaining or exceeding baseline image fidelity on text-to-image diffusion models, with strong quantitative metrics and favorable human judgments. The work demonstrates that operating at lower resolutions at high noise levels is viable and beneficial, and it offers a practical path toward scalable, high-quality diffusion-based generation. The proposed framework is compatible with existing distillation techniques and opens avenues for adaptive scaling and potential video extension.

Abstract

We present SwD, a scale-wise distillation framework for diffusion models (DMs), which effectively employs next-scale prediction ideas for diffusion-based few-step generators. In more detail, SwD is inspired by the recent insights relating diffusion processes to the implicit spectral autoregression. We suppose that DMs can initiate generation at lower data resolutions and gradually upscale the samples at each denoising step without loss in performance while significantly reducing computational costs. SwD naturally integrates this idea into existing diffusion distillation methods based on distribution matching. Also, we enrich the family of distribution matching approaches by introducing a novel patch loss enforcing finer-grained similarity to the target distribution. When applied to state-of-the-art text-to-image diffusion models, SwD approaches the inference times of two full resolution steps and significantly outperforms the counterparts under the same computation budget, as evidenced by automated metrics and human preference studies.

Paper Structure

This paper contains 20 sections, 3 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Spectral analysis of SDXL (Left) and SD3.5 (Right) VAE latents ($128{\times}128$) for different diffusion timesteps. Vertical lines mark frequency boundaries for lower resolutions; frequencies to the right are not present at lower scale latents. Noise masks high frequencies, suggesting that latent DMs can operate at lower latent resolutions for high noise levels.
  • Figure 2: SwD training step.i) Sample training images and the pair of scales [$s_{i}$, $s_{i+1}$] from the scale schedule. ii) The images are downscaled to the $s_{i}$ and $s_{i+1}$ scales. iii) The lower resolution version is upscaled and noised according to the forward diffusion process at the timestep $t_{i}$. iv) Given the noised images, the model $G$ predicts clean images at the target scale $s_{i+1}$. v) Distribution matching loss is calculated between predicted and target images.
  • Figure 3: SwD sampling. Starting from noise at the low scale $s_{1}$, the model gradually increases resolution via multistep stochastic sampling. At each step, the previous prediction at the scale $s_{i-1}$ is upscaled and noised according to the timestep schedule, $t_{i}$. Then, the generator predicts a clean image at the current scale $s_{i}$.
  • Figure 4: SD3.5 generates cropped images at low-resolutions ($256{\times}256$), while SDXL does not produce meaningful images at all. SwD is able to perform successful distillation for such cases and corrects these limitations.
  • Figure 5: Side-by-side comparison between scale-wise and full-scale settings. The numbers indicate the sampling steps.
  • ...and 11 more figures