Table of Contents
Fetching ...

A Simple Early Exiting Framework for Accelerated Sampling in Diffusion Models

Taehong Moon, Moonseok Choi, EungGu Yun, Jongmin Yoon, Gayoung Lee, Jaewoong Cho, Juho Lee

TL;DR

Adaptive Score Estimation (ASE) introduces a time-varying early exiting framework for diffusion models to accelerate sampling by adaptively skipping portions of the score estimation network across time steps. By leveraging the observation that score estimation difficulty varies with diffusion time, ASE employs a time-dependent block-dropping schedule and a targeted fine-tuning strategy using EMA and time-weighted losses to maintain sample quality. Empirical results on DiT and U-ViT backbones show 25–30% faster sampling with comparable FID across multiple solvers and even extend to large-scale text-to-image generation with modest data and acceleration. The approach is architecture-agnostic and solver-compatible, offering a practical pathway to deploy faster diffusion-based generators in real-world settings, with the main limitation being the manually designed dropping schedules.

Abstract

Diffusion models have shown remarkable performance in generation problems over various domains including images, videos, text, and audio. A practical bottleneck of diffusion models is their sampling speed, due to the repeated evaluation of score estimation networks during the inference. In this work, we propose a novel framework capable of adaptively allocating compute required for the score estimation, thereby reducing the overall sampling time of diffusion models. We observe that the amount of computation required for the score estimation may vary along the time step for which the score is estimated. Based on this observation, we propose an early-exiting scheme, where we skip the subset of parameters in the score estimation network during the inference, based on a time-dependent exit schedule. Using the diffusion models for image synthesis, we show that our method could significantly improve the sampling throughput of the diffusion models without compromising image quality. Furthermore, we also demonstrate that our method seamlessly integrates with various types of solvers for faster sampling, capitalizing on their compatibility to enhance overall efficiency. The source code and our experiments are available at \url{https://github.com/taehong-moon/ee-diffusion}

A Simple Early Exiting Framework for Accelerated Sampling in Diffusion Models

TL;DR

Adaptive Score Estimation (ASE) introduces a time-varying early exiting framework for diffusion models to accelerate sampling by adaptively skipping portions of the score estimation network across time steps. By leveraging the observation that score estimation difficulty varies with diffusion time, ASE employs a time-dependent block-dropping schedule and a targeted fine-tuning strategy using EMA and time-weighted losses to maintain sample quality. Empirical results on DiT and U-ViT backbones show 25–30% faster sampling with comparable FID across multiple solvers and even extend to large-scale text-to-image generation with modest data and acceleration. The approach is architecture-agnostic and solver-compatible, offering a practical pathway to deploy faster diffusion-based generators in real-world settings, with the main limitation being the manually designed dropping schedules.

Abstract

Diffusion models have shown remarkable performance in generation problems over various domains including images, videos, text, and audio. A practical bottleneck of diffusion models is their sampling speed, due to the repeated evaluation of score estimation networks during the inference. In this work, we propose a novel framework capable of adaptively allocating compute required for the score estimation, thereby reducing the overall sampling time of diffusion models. We observe that the amount of computation required for the score estimation may vary along the time step for which the score is estimated. Based on this observation, we propose an early-exiting scheme, where we skip the subset of parameters in the score estimation network during the inference, based on a time-dependent exit schedule. Using the diffusion models for image synthesis, we show that our method could significantly improve the sampling throughput of the diffusion models without compromising image quality. Furthermore, we also demonstrate that our method seamlessly integrates with various types of solvers for faster sampling, capitalizing on their compatibility to enhance overall efficiency. The source code and our experiments are available at \url{https://github.com/taehong-moon/ee-diffusion}
Paper Structure (37 sections, 9 equations, 14 figures, 10 tables, 1 algorithm)

This paper contains 37 sections, 9 equations, 14 figures, 10 tables, 1 algorithm.

Figures (14)

  • Figure 1: Snapshot samples of Noise-Easy / Data-Easy schedules when fine-tuned DiT on ImageNet. While the data-easy schedule struggles to produce a discernible dog image, the noise-easy schedule successfully generates a clear dog image, achieving a converged FID score of 8.88.
  • Figure 2: Schematic for time-dependent exit schedule. Considering the varying difficulty of score estimation, we drop more building blocks of architecture near noise. While we skip the whole building blocks in DiT, we partially skip the blocks in U-ViT due to the long skip-connection.
  • Figure 3: Trade-off between image generation quality and sampling speed on ImageNet with DiT (left) and CelebA with U-ViT (right). We generate samples from DDIM and DPM sampler with $50$ steps for ImageNet and CelebA, respectively. ASE largely outperforms other techniques, preserving FID score while boosting sampling speed by approximately 25-30%. Here, SS stands for scale-shift adjustment used together with block caching.
  • Figure 4: Robustness of ASE across varying sampling timesteps: ImageNet with DDIM solver (left), and CelebA with DPM solver (right). Both experiments employed U-ViT architecture. ASE displays robust performance throughout different timesteps in both different experimental settings.
  • Figure 5: Comparison between samples produced by pre-trained PixArt-$\alpha$ and ASE-enhanced PixArt-$\alpha$. Text prompts are randomly chosen.
  • ...and 9 more figures