Table of Contents
Fetching ...

Training-free Diffusion Acceleration with Bottleneck Sampling

Ye Tian, Xin Xia, Yuxi Ren, Shanchuan Lin, Xing Wang, Xuefeng Xiao, Yunhai Tong, Ling Yang, Bin Cui

TL;DR

Diffusion models are computationally expensive at high resolutions due to self-attention's quadratic cost. This work introduces Bottleneck Sampling, a training-free framework that leverages low-resolution priors through a high-low-high denoising workflow, complemented by resolution-change noise reintroduction and scheduler re-shifting to maintain fidelity. The approach, applied to both image and video diffusion transformers, achieves up to $3\times$ speedup for images and $2.5\times$ for videos while preserving output quality across established metrics and human evaluation. The method requires no architectural changes or retraining, making it a practical, plug-and-play acceleration strategy with broad impact for deploying diffusion models in resource-constrained environments.

Abstract

Diffusion models have demonstrated remarkable capabilities in visual content generation but remain challenging to deploy due to their high computational cost during inference. This computational burden primarily arises from the quadratic complexity of self-attention with respect to image or video resolution. While existing acceleration methods often compromise output quality or necessitate costly retraining, we observe that most diffusion models are pre-trained at lower resolutions, presenting an opportunity to exploit these low-resolution priors for more efficient inference without degrading performance. In this work, we introduce Bottleneck Sampling, a training-free framework that leverages low-resolution priors to reduce computational overhead while preserving output fidelity. Bottleneck Sampling follows a high-low-high denoising workflow: it performs high-resolution denoising in the initial and final stages while operating at lower resolutions in intermediate steps. To mitigate aliasing and blurring artifacts, we further refine the resolution transition points and adaptively shift the denoising timesteps at each stage. We evaluate Bottleneck Sampling on both image and video generation tasks, where extensive experiments demonstrate that it accelerates inference by up to 3$\times$ for image generation and 2.5$\times$ for video generation, all while maintaining output quality comparable to the standard full-resolution sampling process across multiple evaluation metrics.

Training-free Diffusion Acceleration with Bottleneck Sampling

TL;DR

Diffusion models are computationally expensive at high resolutions due to self-attention's quadratic cost. This work introduces Bottleneck Sampling, a training-free framework that leverages low-resolution priors through a high-low-high denoising workflow, complemented by resolution-change noise reintroduction and scheduler re-shifting to maintain fidelity. The approach, applied to both image and video diffusion transformers, achieves up to speedup for images and for videos while preserving output quality across established metrics and human evaluation. The method requires no architectural changes or retraining, making it a practical, plug-and-play acceleration strategy with broad impact for deploying diffusion models in resource-constrained environments.

Abstract

Diffusion models have demonstrated remarkable capabilities in visual content generation but remain challenging to deploy due to their high computational cost during inference. This computational burden primarily arises from the quadratic complexity of self-attention with respect to image or video resolution. While existing acceleration methods often compromise output quality or necessitate costly retraining, we observe that most diffusion models are pre-trained at lower resolutions, presenting an opportunity to exploit these low-resolution priors for more efficient inference without degrading performance. In this work, we introduce Bottleneck Sampling, a training-free framework that leverages low-resolution priors to reduce computational overhead while preserving output fidelity. Bottleneck Sampling follows a high-low-high denoising workflow: it performs high-resolution denoising in the initial and final stages while operating at lower resolutions in intermediate steps. To mitigate aliasing and blurring artifacts, we further refine the resolution transition points and adaptively shift the denoising timesteps at each stage. We evaluate Bottleneck Sampling on both image and video generation tasks, where extensive experiments demonstrate that it accelerates inference by up to 3 for image generation and 2.5 for video generation, all while maintaining output quality comparable to the standard full-resolution sampling process across multiple evaluation metrics.

Paper Structure

This paper contains 36 sections, 12 equations, 11 figures, 7 tables, 1 algorithm.

Figures (11)

  • Figure 1: Comparison of sampling strategies in our framework. (i) Standard Sampling. (ii) Our Bottleneck Sampling: a high-low-high workflow that captures semantics early, improves efficiency in the middle, and restores details at the end. Images generated by FLUX.1-dev using the prompt: "Design a stylish dancer's back logo with the letters 'R' and 'Y'".
  • Figure 2: Main Results of our Bottleneck Sampling on both text-to-image generation and text-to-video generation. Bottleneck Sampling maintains comparable performance with a 2.5 - 3 $\times$ acceleration ratio in a training-free manner.
  • Figure 3: Overall pipeline of our Bottleneck Sampling. The process consists of three stages: (i) High-Resolution Denoising to preserve semantic information, (ii) Low-Resolution Denoising to improve efficiency, and (iii) High-Resolution Denoising to restore fine details. Images generated by FLUX.1-dev using the prompt: "2D cartoon,Diagonal composition, Medium close-up, a whole body of a classical doll being held by a hand, the doll of a young boy with white hair dressed in purple, He has pale skin and white eyes.".
  • Figure 4: Timestep Shifting Visualization at different shifting factors settings. Higher shifting scales lead to denoising in higher-noise regions
  • Figure 5: Qualitative comparison of our Bottleneck Sampling with FLUX.1-dev. Our method achieves up tp 3$\times$ speedup while maintaining or improving visual fidelity. Incorrect text rendering and anatomical inconsistencies are highlighted with different colors. Full prompts are provided in \ref{['app:prompt_list']}.
  • ...and 6 more figures