Table of Contents
Fetching ...

Efficient Diffusion Models for Vision: A Survey

Anwaar Ulhaq, Naveed Akhtar

TL;DR

This survey targets the computational efficiency of diffusion models for vision, arguing that while DDPM-based approaches yield high-quality samples, their training and inference costs hinder widespread adoption. It organizes the literature into Efficient Design Strategies and Efficient Process Strategies, detailing architectural and methodological innovations that accelerate diffusion, including latent space diffusion (LDM), multi-scale pyramidal designs (Frido), and diffusion-guided conditioning. The article provides a structured comparison of quality and efficiency metrics, highlights the dominance of diffusion methods in image synthesis on benchmarks like ImageNet while acknowledging resource demands, and discusses future directions to democratize access. Overall, the work offers a pragmatic roadmap for designing practical, scalable diffusion models without sacrificing performance, guiding researchers toward efficiency-aware innovations and standardized benchmarks.

Abstract

Diffusion Models (DMs) have demonstrated state-of-the-art performance in content generation without requiring adversarial training. These models are trained using a two-step process. First, a forward - diffusion - process gradually adds noise to a datum (usually an image). Then, a backward - reverse diffusion - process gradually removes the noise to turn it into a sample of the target distribution being modelled. DMs are inspired by non-equilibrium thermodynamics and have inherent high computational complexity. Due to the frequent function evaluations and gradient calculations in high-dimensional spaces, these models incur considerable computational overhead during both training and inference stages. This can not only preclude the democratization of diffusion-based modelling, but also hinder the adaption of diffusion models in real-life applications. Not to mention, the efficiency of computational models is fast becoming a significant concern due to excessive energy consumption and environmental scares. These factors have led to multiple contributions in the literature that focus on devising computationally efficient DMs. In this review, we present the most recent advances in diffusion models for vision, specifically focusing on the important design aspects that affect the computational efficiency of DMs. In particular, we emphasize the recently proposed design choices that have led to more efficient DMs. Unlike the other recent reviews, which discuss diffusion models from a broad perspective, this survey is aimed at pushing this research direction forward by highlighting the design strategies in the literature that are resulting in practicable models for the broader research community. We also provide a future outlook of diffusion models in vision from their computational efficiency viewpoint.

Efficient Diffusion Models for Vision: A Survey

TL;DR

This survey targets the computational efficiency of diffusion models for vision, arguing that while DDPM-based approaches yield high-quality samples, their training and inference costs hinder widespread adoption. It organizes the literature into Efficient Design Strategies and Efficient Process Strategies, detailing architectural and methodological innovations that accelerate diffusion, including latent space diffusion (LDM), multi-scale pyramidal designs (Frido), and diffusion-guided conditioning. The article provides a structured comparison of quality and efficiency metrics, highlights the dominance of diffusion methods in image synthesis on benchmarks like ImageNet while acknowledging resource demands, and discusses future directions to democratize access. Overall, the work offers a pragmatic roadmap for designing practical, scalable diffusion models without sacrificing performance, guiding researchers toward efficiency-aware innovations and standardized benchmarks.

Abstract

Diffusion Models (DMs) have demonstrated state-of-the-art performance in content generation without requiring adversarial training. These models are trained using a two-step process. First, a forward - diffusion - process gradually adds noise to a datum (usually an image). Then, a backward - reverse diffusion - process gradually removes the noise to turn it into a sample of the target distribution being modelled. DMs are inspired by non-equilibrium thermodynamics and have inherent high computational complexity. Due to the frequent function evaluations and gradient calculations in high-dimensional spaces, these models incur considerable computational overhead during both training and inference stages. This can not only preclude the democratization of diffusion-based modelling, but also hinder the adaption of diffusion models in real-life applications. Not to mention, the efficiency of computational models is fast becoming a significant concern due to excessive energy consumption and environmental scares. These factors have led to multiple contributions in the literature that focus on devising computationally efficient DMs. In this review, we present the most recent advances in diffusion models for vision, specifically focusing on the important design aspects that affect the computational efficiency of DMs. In particular, we emphasize the recently proposed design choices that have led to more efficient DMs. Unlike the other recent reviews, which discuss diffusion models from a broad perspective, this survey is aimed at pushing this research direction forward by highlighting the design strategies in the literature that are resulting in practicable models for the broader research community. We also provide a future outlook of diffusion models in vision from their computational efficiency viewpoint.
Paper Structure (31 sections, 13 equations, 7 figures, 6 tables)

This paper contains 31 sections, 13 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: (a) Timeline of notable developments (non-exhaustive) in diffusion modelling. (b) The number of per-month and accumulative papers in diffusion models in the last 12 months based on Google Scholar search. (c) The proportion of research papers in terms of main application areas for diffusion models. The applications include Image Denoising (ID), Image Generation (IG), Time Series (TS), Semantic Segmentation (SS), Image Super-resolution (IS), BIG-bench Machine learning (BM), Image Inpainting (II), Decision Making (DM), and Image-to-image Translation (IT).
  • Figure 2: State-of-the-art diffusion models are able to generate excellent quality samples for different tasks with minimal effort on their user's part. This portends a large-scale use of these models in the future in the applications ranging from research to entertainment. The shown images are cropped from the original works.
  • Figure 3: The directed graphical model illustrates processes involved in a diffusion model. The original sample $\boldsymbol{x}_0$ gets gradually corrupted with a Markov process to look like noise $\boldsymbol{x}_T$. The model learns to denoise the corrupted image at every step by learning the conditional probability $p_{\theta} (\boldsymbol{x}_{t-1}| \boldsymbol{x}_t)$. Image taken from DDPM.
  • Figure 4: The architecture of the latent diffusion model (LDM) that is considered a revolutionary work that has been employed in stable diffusion and turned the direction of research towards efficient discussion models in general. (Source:LDM )
  • Figure 5: The architecture of the Feature Pyramid Diffusion Model (Frido) encodes an image into multi-scale feature maps $\mathbf z$ to improve the efficiency of diffusion models. (Source:Frido)
  • ...and 2 more figures