Table of Contents
Fetching ...

BudgetFusion: Perceptually-Guided Adaptive Diffusion Models

Qinchan Li, Kenneth Chen, Changyue Su, Qi Sun

TL;DR

BudgetFusion addresses the high energy cost of diffusion-based text-to-image generation by introducing perceptually guided adaptive diffusion. It builds three time-series perceptual metrics (L-SNR, D-SIM, I-CLIP) as a function of diffusion steps and uses an LSTM-based predictor to estimate these curves from a given prompt; plateau-based detection then selects the optimal number of denoising steps $t^{*}$ before generation. Across a large synthetic dataset and user studies, BudgetFusion achieves substantial time savings (up to ~63% reduction in per-image time) without perceptual quality degradation, improving perceptual gains per diffusion step by approximately 6–9% across metrics. The work demonstrates a practical, human-centered approach to reducing compute and energy costs in generative diffusion models, with implications for on-device deployment and greener AI systems.

Abstract

Diffusion models have shown unprecedented success in the task of text-to-image generation. While these models are capable of generating high-quality and realistic images, the complexity of sequential denoising has raised societal concerns regarding high computational demands and energy consumption. In response, various efforts have been made to improve inference efficiency. However, most of the existing efforts have taken a fixed approach with neural network simplification or text prompt optimization. Are the quality improvements from all denoising computations equally perceivable to humans? We observed that images from different text prompts may require different computational efforts given the desired content. The observation motivates us to present BudgetFusion, a novel model that suggests the most perceptually efficient number of diffusion steps before a diffusion model starts to generate an image. This is achieved by predicting multi-level perceptual metrics relative to diffusion steps. With the popular Stable Diffusion as an example, we conduct both numerical analyses and user studies. Our experiments show that BudgetFusion saves up to five seconds per prompt without compromising perceptual similarity. We hope this work can initiate efforts toward answering a core question: how much do humans perceptually gain from images created by a generative model, per watt of energy?

BudgetFusion: Perceptually-Guided Adaptive Diffusion Models

TL;DR

BudgetFusion addresses the high energy cost of diffusion-based text-to-image generation by introducing perceptually guided adaptive diffusion. It builds three time-series perceptual metrics (L-SNR, D-SIM, I-CLIP) as a function of diffusion steps and uses an LSTM-based predictor to estimate these curves from a given prompt; plateau-based detection then selects the optimal number of denoising steps before generation. Across a large synthetic dataset and user studies, BudgetFusion achieves substantial time savings (up to ~63% reduction in per-image time) without perceptual quality degradation, improving perceptual gains per diffusion step by approximately 6–9% across metrics. The work demonstrates a practical, human-centered approach to reducing compute and energy costs in generative diffusion models, with implications for on-device deployment and greener AI systems.

Abstract

Diffusion models have shown unprecedented success in the task of text-to-image generation. While these models are capable of generating high-quality and realistic images, the complexity of sequential denoising has raised societal concerns regarding high computational demands and energy consumption. In response, various efforts have been made to improve inference efficiency. However, most of the existing efforts have taken a fixed approach with neural network simplification or text prompt optimization. Are the quality improvements from all denoising computations equally perceivable to humans? We observed that images from different text prompts may require different computational efforts given the desired content. The observation motivates us to present BudgetFusion, a novel model that suggests the most perceptually efficient number of diffusion steps before a diffusion model starts to generate an image. This is achieved by predicting multi-level perceptual metrics relative to diffusion steps. With the popular Stable Diffusion as an example, we conduct both numerical analyses and user studies. Our experiments show that BudgetFusion saves up to five seconds per prompt without compromising perceptual similarity. We hope this work can initiate efforts toward answering a core question: how much do humans perceptually gain from images created by a generative model, per watt of energy?

Paper Structure

This paper contains 31 sections, 9 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Given an input text prompt, our BudgetFusion model guides the number of denoising steps $t$ before the generation starts. It balances the trade-off between visual perception quality and computational cost, achieving an optimized "quality gain per denoising step" efficiency.
  • Figure 2: Example generative images with different visual complexity. With current diffusion models, the two images are generated with the same denoising steps and computational cost. However, intuitively, the simpler \ref{['fig:intro:simple']} could have been generated with less computation than \ref{['fig:intro:complex']}. This insight motivates us to develop BudgetFusion, an efficiency-optimized guidance for balancing quality vs. computation trade-offs. BudgetFusion tailors the denoising process to align with the given prompt before generation starts.
  • Figure 3: Our BudgetFusion pipeline. \ref{['fig:overview:pipeline']} Given an input prompt, we predict three time series of perceptual quality metrics of the generated images at different timesteps (the three colored curves). Each one of them represents a given perceptual scale. The model determines the optimal timestep, $t^*$, which is the max plateau point of the three metrics, described in \ref{['sec:method:metrics']}. The pre-trained diffusion model performs $t^*$ number of denoising steps, rather than continuing the forward process, which would only yield little image quality improvement as predicted by our model. \ref{['fig:overview:example']} We include an additional example of the forward process and the selection made by our model.
  • Figure 4: Model architecture. We visualize our architecture as an unrolled LSTM, $\theta_m$, which takes as input positionally-embedded timestep, $\overline{t}_i$, and clip-embedded prompt, $\overline{p}$. Outputs for each timestep are fed to fully-connected layers, and normalized with a sigmoid activation to produce scores, $m$, which are one of three perceptual metrics defined in \ref{['sec:method:metrics']}.
  • Figure 5: Example results of predicting denoising steps before generation. By leveraging predicted perceptual metrics with regard to timesteps as a time series, we predict the "plateau points" for each. Their max (\ref{['eqn:optimization']}) suggests the most efficient timesteps for diffusion models, before the generation starts.
  • ...and 8 more figures