Table of Contents
Fetching ...

Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation

Aniketh Iyengar, Jiaqi Han, Boris Ruf, Vincent Grari, Marcin Detyniecki, Stefano Ermon

TL;DR

Energy consumption predictions for diffusion-model inference are advanced through a Kaplan-style scaling framework that links energy $E$ to compute FLOPs with hardware modifiers. By decomposing inference into text encoding, iterative denoising, and decoding, and showing that denoising dominates compute, the authors validate a near-linear scaling of energy with FLOPs across four architectures and three GPUs. The approach achieves $R^2>0.9$ and strong cross-model/generalization performance, supporting pre-deployment energy budgeting, carbon-aware optimizations, and standardized energy reporting. This framework provides a practical tool for sustainable AI deployment, enabling energy-conscious decisions on precision, step count, resolution, and hardware selection.

Abstract

The rapidly growing computational demands of diffusion models for image generation have raised significant concerns about energy consumption and environmental impact. While existing approaches to energy optimization focus on architectural improvements or hardware acceleration, there is a lack of principled methods to predict energy consumption across different model configurations and hardware setups. We propose an adaptation of Kaplan scaling laws to predict GPU energy consumption for diffusion models based on computational complexity (FLOPs). Our approach decomposes diffusion model inference into text encoding, iterative denoising, and decoding components, with the hypothesis that denoising operations dominate energy consumption due to their repeated execution across multiple inference steps. We conduct comprehensive experiments across four state-of-the-art diffusion models (Stable Diffusion 2, Stable Diffusion 3.5, Flux, and Qwen) on three GPU architectures (NVIDIA A100, A4000, A6000), spanning various inference configurations including resolution (256x256 to 1024x1024), precision (fp16/fp32), step counts (10-50), and classifier-free guidance settings. Our energy scaling law achieves high predictive accuracy within individual architectures (R-squared > 0.9) and exhibits strong cross-architecture generalization, maintaining high rank correlations across models and enabling reliable energy estimation for unseen model-hardware combinations. These results validate the compute-bound nature of diffusion inference and provide a foundation for sustainable AI deployment planning and carbon footprint estimation.

Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation

TL;DR

Energy consumption predictions for diffusion-model inference are advanced through a Kaplan-style scaling framework that links energy to compute FLOPs with hardware modifiers. By decomposing inference into text encoding, iterative denoising, and decoding, and showing that denoising dominates compute, the authors validate a near-linear scaling of energy with FLOPs across four architectures and three GPUs. The approach achieves and strong cross-model/generalization performance, supporting pre-deployment energy budgeting, carbon-aware optimizations, and standardized energy reporting. This framework provides a practical tool for sustainable AI deployment, enabling energy-conscious decisions on precision, step count, resolution, and hardware selection.

Abstract

The rapidly growing computational demands of diffusion models for image generation have raised significant concerns about energy consumption and environmental impact. While existing approaches to energy optimization focus on architectural improvements or hardware acceleration, there is a lack of principled methods to predict energy consumption across different model configurations and hardware setups. We propose an adaptation of Kaplan scaling laws to predict GPU energy consumption for diffusion models based on computational complexity (FLOPs). Our approach decomposes diffusion model inference into text encoding, iterative denoising, and decoding components, with the hypothesis that denoising operations dominate energy consumption due to their repeated execution across multiple inference steps. We conduct comprehensive experiments across four state-of-the-art diffusion models (Stable Diffusion 2, Stable Diffusion 3.5, Flux, and Qwen) on three GPU architectures (NVIDIA A100, A4000, A6000), spanning various inference configurations including resolution (256x256 to 1024x1024), precision (fp16/fp32), step counts (10-50), and classifier-free guidance settings. Our energy scaling law achieves high predictive accuracy within individual architectures (R-squared > 0.9) and exhibits strong cross-architecture generalization, maintaining high rank correlations across models and enabling reliable energy estimation for unseen model-hardware combinations. These results validate the compute-bound nature of diffusion inference and provide a foundation for sustainable AI deployment planning and carbon footprint estimation.

Paper Structure

This paper contains 31 sections, 5 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Individual model energy scaling validation on NVIDIA A100 GPU. Diagnostic plots showing actual versus predicted energy consumption for (a) Flux, (b) Stable Diffusion 3.5, and (c) Qwen diffusion models. (d) Shows the learned scaling parameters with exponents $\alpha$ approaching the theoretical compute-bound ideal of 1.0. All models exhibit near-linear FLOP-energy relationships, confirming the compute-bound nature of diffusion inference. Note: $\beta_{\text{gpu1}}$=$\beta_{\text{gpu2}}$=0 since only A100 data is used; Qwen $\beta_{\text{dtype}}$=0 due to fp16-only training.
  • Figure 2: Diagnostic plots show actual versus predicted energy consumption for (a) Flux and (b) Stable Diffusion 3.5 models, using data from NVIDIA A100 and A6000 GPUs. Panel (c) presents the learned scaling parameters, highlighting stable exponents ($\alpha$) across GPUs and GPU-specific coefficients. Note: The Flux plot includes no-CFG, float16, and non-50-prompt runs, while the SD 3.5 plot includes only CFG, float16, and non-50-prompt runs.
  • Figure 3: The top row shows training on model pairs: (a) Qwen + SD 3.5, (b) Flux + SD 3.5, and (c) Flux + Qwen. The bottom row presents the corresponding tests on the held-out models: Flux, Qwen, and SD 3.5, respectively. Consistent diagnostic patterns across all training–test pairs demonstrate the robustness of our FLOP-based scaling methodology for cross-model energy prediction. All plots here use results from the NVIDIA A100, which yielded the most comprehensive hyperparameter search.
  • Figure 4: Diagnostic plots show (a) Stable Diffusion 2 on NVIDIA A100, illustrating U-Net scaling behavior, and (b) cross-GPU validation across A100, A4000, and A6000 platforms. Consistent scaling patterns confirm that our FLOP-based energy prediction generalizes beyond transformer-based models to convolutional architectures. Note: Cross-GPU results include CFG, float16, and non-50-prompt runs.
  • Figure 5: Cross-architecture experiments demonstrate generalization between U-Net and MMDiT architectures on the A100. Top row shows training: (a) Qwen+SD2 (MMDiT+U-Net), (b) Flux+Qwen+SD3.5 (all MMDiT), (c) SD3.5+Flux (both MMDiT). Bottom row shows corresponding testing: Flux+SD3.5 (MMDiT), SD2 (U-Net), SD2+Qwen (U-Net+MMDiT). The consistent scaling patterns validate that our FLOP-based methodology captures fundamental energy-complexity relationships independent of specific architectural paradigms, successfully bridging traditional convolutional U-Net designs and modern transformer-based MMDiT approaches.
  • ...and 7 more figures