DuoDiff: Accelerating Diffusion Models with a Dual-Backbone Approach
Daniel Gallo Fernández, Răzvan-Andrei Matişan, Alejandro Monroy Muñoz, Ana-Maria Vasilcoiu, Janusz Partyka, Tin Hadži Veljković, Metod Jazbec
TL;DR
Diffusion models face slow inference due to iterative denoising. The authors study adaptive early-exit behavior and reveal a phase transition where early timesteps exit early and later timesteps require the full backbone. They propose DuoDiff, a static dual-backbone diffusion framework with a shallow backbone for the initial phase and a deep backbone for the remainder, trained on the same objective with a fixed transition at $t_s$. Empirically, DuoDiff outperforms AdaDiff in both inference speed and image quality across CIFAR-10, CelebA, and ImageNet, and remains compatible with latent-space diffusion and DDIM, offering a practical, batch-friendly acceleration strategy.
Abstract
Diffusion models have achieved unprecedented performance in image generation, yet they suffer from slow inference due to their iterative sampling process. To address this, early-exiting has recently been proposed, where the depth of the denoising network is made adaptive based on the (estimated) difficulty of each sampling step. Here, we discover an interesting "phase transition" in the sampling process of current adaptive diffusion models: the denoising network consistently exits early during the initial sampling steps, until it suddenly switches to utilizing the full network. Based on this, we propose accelerating generation by employing a shallower denoising network in the initial sampling steps and a deeper network in the later steps. We demonstrate empirically that our dual-backbone approach, DuoDiff, outperforms existing early-exit diffusion methods in both inference speed and generation quality. Importantly, DuoDiff is easy to implement and complementary to existing approaches for accelerating diffusion.
