Table of Contents
Fetching ...

DuoDiff: Accelerating Diffusion Models with a Dual-Backbone Approach

Daniel Gallo Fernández, Răzvan-Andrei Matişan, Alejandro Monroy Muñoz, Ana-Maria Vasilcoiu, Janusz Partyka, Tin Hadži Veljković, Metod Jazbec

TL;DR

Diffusion models face slow inference due to iterative denoising. The authors study adaptive early-exit behavior and reveal a phase transition where early timesteps exit early and later timesteps require the full backbone. They propose DuoDiff, a static dual-backbone diffusion framework with a shallow backbone for the initial phase and a deep backbone for the remainder, trained on the same objective with a fixed transition at $t_s$. Empirically, DuoDiff outperforms AdaDiff in both inference speed and image quality across CIFAR-10, CelebA, and ImageNet, and remains compatible with latent-space diffusion and DDIM, offering a practical, batch-friendly acceleration strategy.

Abstract

Diffusion models have achieved unprecedented performance in image generation, yet they suffer from slow inference due to their iterative sampling process. To address this, early-exiting has recently been proposed, where the depth of the denoising network is made adaptive based on the (estimated) difficulty of each sampling step. Here, we discover an interesting "phase transition" in the sampling process of current adaptive diffusion models: the denoising network consistently exits early during the initial sampling steps, until it suddenly switches to utilizing the full network. Based on this, we propose accelerating generation by employing a shallower denoising network in the initial sampling steps and a deeper network in the later steps. We demonstrate empirically that our dual-backbone approach, DuoDiff, outperforms existing early-exit diffusion methods in both inference speed and generation quality. Importantly, DuoDiff is easy to implement and complementary to existing approaches for accelerating diffusion.

DuoDiff: Accelerating Diffusion Models with a Dual-Backbone Approach

TL;DR

Diffusion models face slow inference due to iterative denoising. The authors study adaptive early-exit behavior and reveal a phase transition where early timesteps exit early and later timesteps require the full backbone. They propose DuoDiff, a static dual-backbone diffusion framework with a shallow backbone for the initial phase and a deep backbone for the remainder, trained on the same objective with a fixed transition at . Empirically, DuoDiff outperforms AdaDiff in both inference speed and image quality across CIFAR-10, CelebA, and ImageNet, and remains compatible with latent-space diffusion and DDIM, offering a practical, batch-friendly acceleration strategy.

Abstract

Diffusion models have achieved unprecedented performance in image generation, yet they suffer from slow inference due to their iterative sampling process. To address this, early-exiting has recently been proposed, where the depth of the denoising network is made adaptive based on the (estimated) difficulty of each sampling step. Here, we discover an interesting "phase transition" in the sampling process of current adaptive diffusion models: the denoising network consistently exits early during the initial sampling steps, until it suddenly switches to utilizing the full network. Based on this, we propose accelerating generation by employing a shallower denoising network in the initial sampling steps and a deeper network in the later steps. We demonstrate empirically that our dual-backbone approach, DuoDiff, outperforms existing early-exit diffusion methods in both inference speed and generation quality. Importantly, DuoDiff is easy to implement and complementary to existing approaches for accelerating diffusion.

Paper Structure

This paper contains 17 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Early-exit trends in AdaDiff adadiff. The plots show the average exit layer across 5,120 images for different datasets and various exiting thresholds $\theta$. We observe that early-exiting in the denoising network occurs only at the start of the generation process (for $t$ close to $T$), followed by a sudden switch to using the full denoising network for the remaining generation steps. The pattern is consistent across different datasets and resembles a step function.
  • Figure 2: Denoising objective. Given a noisy image and a timestep, the model must predict the added noise. As we can observe, this task is easier for high values of $t$, in which the expected output is very similar to the input.
  • Figure 3: DuoDiff framework. DuoDiff employs a shallow three-layer U-ViT backbone for the first $t_s$ timesteps to reduce computational overhead, before switching to a full backbone for the remaining denoising steps, ensuring both efficiency and image quality. Both backbones are trained on the same dataset using the same diffusion objective.
  • Figure 4: Comparison of AdaDiff and DuoDiff. Comparison of AdaDiff and DuoDiff. The plot shows FID score and generation time per sample (lower is better for both) across two datasets (ImageNet $64\times64$ and $256\times256$). Each point represents a different parameter configuration, including the base model, which can be seen as a special case of DuoDiff ($t_s = 0$). We can see how DuoDiff consistently outperforms AdaDiff in both performance and inference time.
  • Figure 5: Qualitative hyperparameter analysis. Comparison of image generation results for AdaDiff (left) and DuoDiff (right) on the ImageNet dataset ($256 \times 256$) using different values for their respective hyperparameters ($\theta$ in AdaDiff and $t_s$ in DuoDiff). We observe how higher values of $\theta$ and $t_s$ diminish the quality of the generated images.
  • ...and 2 more figures