Table of Contents
Fetching ...

AdaDiff: Adaptive Step Selection for Fast Diffusion Models

Hui Zhang, Zuxuan Wu, Zhen Xing, Jie Shao, Yu-Gang Jiang

TL;DR

AdaDiff tackles the slow sampling in diffusion models by learning instance-specific denoising step budgets conditioned on prompt richness. It introduces a lightweight step-selection network trained with policy gradient to maximize a reward that balances inference time and image/video quality, enabling per-input optimization. Across image and video benchmarks, AdaDiff achieves substantial speedups (roughly 33–40%) with comparable quality and can be combined with other acceleration methods or transferred zero-shot between datasets. The approach yields intuitive policies that allocate more steps to richer prompts, demonstrating practical impact for fast, text-conditioned diffusion generation.

Abstract

Diffusion models, as a type of generative model, have achieved impressive results in generating images and videos conditioned on textual conditions. However, the generation process of diffusion models involves denoising dozens of steps to produce photorealistic images/videos, which is computationally expensive. Unlike previous methods that design ``one-size-fits-all'' approaches for speed up, we argue denoising steps should be sample-specific conditioned on the richness of input texts. To this end, we introduce AdaDiff, a lightweight framework designed to learn instance-specific step usage policies, which are then used by the diffusion model for generation. AdaDiff is optimized using a policy gradient method to maximize a carefully designed reward function, balancing inference time and generation quality. We conduct experiments on three image generation and two video generation benchmarks and demonstrate that our approach achieves similar visual quality compared to the baseline using a fixed 50 denoising steps while reducing inference time by at least 33%, going as high as 40%. Furthermore, our method can be used on top of other acceleration methods to provide further speed benefits. Lastly, qualitative analysis shows that AdaDiff allocates more steps to more informative prompts and fewer steps to simpler prompts.

AdaDiff: Adaptive Step Selection for Fast Diffusion Models

TL;DR

AdaDiff tackles the slow sampling in diffusion models by learning instance-specific denoising step budgets conditioned on prompt richness. It introduces a lightweight step-selection network trained with policy gradient to maximize a reward that balances inference time and image/video quality, enabling per-input optimization. Across image and video benchmarks, AdaDiff achieves substantial speedups (roughly 33–40%) with comparable quality and can be combined with other acceleration methods or transferred zero-shot between datasets. The approach yields intuitive policies that allocate more steps to richer prompts, demonstrating practical impact for fast, text-conditioned diffusion generation.

Abstract

Diffusion models, as a type of generative model, have achieved impressive results in generating images and videos conditioned on textual conditions. However, the generation process of diffusion models involves denoising dozens of steps to produce photorealistic images/videos, which is computationally expensive. Unlike previous methods that design ``one-size-fits-all'' approaches for speed up, we argue denoising steps should be sample-specific conditioned on the richness of input texts. To this end, we introduce AdaDiff, a lightweight framework designed to learn instance-specific step usage policies, which are then used by the diffusion model for generation. AdaDiff is optimized using a policy gradient method to maximize a carefully designed reward function, balancing inference time and generation quality. We conduct experiments on three image generation and two video generation benchmarks and demonstrate that our approach achieves similar visual quality compared to the baseline using a fixed 50 denoising steps while reducing inference time by at least 33%, going as high as 40%. Furthermore, our method can be used on top of other acceleration methods to provide further speed benefits. Lastly, qualitative analysis shows that AdaDiff allocates more steps to more informative prompts and fewer steps to simpler prompts.
Paper Structure (15 sections, 7 equations, 7 figures, 9 tables)

This paper contains 15 sections, 7 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: A conceptual overview of our approach. AdaDiff assigns instance-specific denoising steps based on prompt richness to minimize inference time with minimal quality loss. Images with red borders are produced by AdaDiff.
  • Figure 2: An overview of AdaDiff. Given the input prompts, the step selection network learns the information richness of each prompt and derives the corresponding step usage policy. These policies determine the number of steps required for the diffusion model to generate images. Subsequently, the reward function balances the trade-off between speed and image quality.
  • Figure 3: Analyze the learned step policy based on the number of words and objects in the prompts. AdaDiff tends to assign more steps to more informative prompts.
  • Figure 4: Qualitative Results. AdaDiff implements an instance-specific dynamic generation based on the prompt complexity.
  • Figure 5: The speed and quality trade-offs modulated by top-$k$ and image reward weight.
  • ...and 2 more figures