Table of Contents
Fetching ...

An overview of diffusion models for generative artificial intelligence

Davide Gallon, Arnulf Jentzen, Philippe von Wurstemberger

TL;DR

The article provides a rigorous mathematical treatment of denoising diffusion probabilistic models (DDPMs) for generative AI, framing the problem with a forward diffusion $X^{\varnothing}$ that progressively adds noise and a learnable backward denoiser $X^{\theta}$ that reconstructs data from noise. It derives Gaussian-forward/backward dynamics, Bayes rules for Gaussian transitions, and a tractable training objective based on the cross-entropy $\negloglike{\mathfrak{p}^{\varnothing}_{0}}{\mathfrak{p}^{\theta}_{0}}$, with an upper bound that decomposes into per-step terms to guide learning. The paper then surveys a suite of advanced variants—Improved DDPM, DDIM, classifier-free diffusion guidance, and latent diffusion models such as Stable Diffusion—highlighting improvements in fidelity, controllability (including text and class conditioning), and sampling efficiency. By detailing architectures like UNets with time embeddings, evaluation metrics (e.g., Inception Score and Fréchet Inception Distance), and leading models such as GLIDE, DALL-E 2/3, and Imagen, the work provides a cohesive roadmap for deploying diffusion-based generative systems across vision and multimodal tasks.

Abstract

This article provides a mathematically rigorous introduction to denoising diffusion probabilistic models (DDPMs), sometimes also referred to as diffusion probabilistic models or diffusion models, for generative artificial intelligence. We provide a detailed basic mathematical framework for DDPMs and explain the main ideas behind training and generation procedures. In this overview article we also review selected extensions and improvements of the basic framework from the literature such as improved DDPMs, denoising diffusion implicit models, classifier-free diffusion guidance models, and latent diffusion models.

An overview of diffusion models for generative artificial intelligence

TL;DR

The article provides a rigorous mathematical treatment of denoising diffusion probabilistic models (DDPMs) for generative AI, framing the problem with a forward diffusion that progressively adds noise and a learnable backward denoiser that reconstructs data from noise. It derives Gaussian-forward/backward dynamics, Bayes rules for Gaussian transitions, and a tractable training objective based on the cross-entropy , with an upper bound that decomposes into per-step terms to guide learning. The paper then surveys a suite of advanced variants—Improved DDPM, DDIM, classifier-free diffusion guidance, and latent diffusion models such as Stable Diffusion—highlighting improvements in fidelity, controllability (including text and class conditioning), and sampling efficiency. By detailing architectures like UNets with time embeddings, evaluation metrics (e.g., Inception Score and Fréchet Inception Distance), and leading models such as GLIDE, DALL-E 2/3, and Imagen, the work provides a cohesive roadmap for deploying diffusion-based generative systems across vision and multimodal tasks.

Abstract

This article provides a mathematically rigorous introduction to denoising diffusion probabilistic models (DDPMs), sometimes also referred to as diffusion probabilistic models or diffusion models, for generative artificial intelligence. We provide a detailed basic mathematical framework for DDPMs and explain the main ideas behind training and generation procedures. In this overview article we also review selected extensions and improvements of the basic framework from the literature such as improved DDPMs, denoising diffusion implicit models, classifier-free diffusion guidance models, and latent diffusion models.

Paper Structure

This paper contains 43 sections, 20 theorems, 126 equations, 6 figures, 1 table.

Key Result

Lemma 2.5

Assume setting01. Then it holds for all $\theta \in \mathbb{R}^{\mathfrak{d}}$, $t \in \{1,\ldots,T\}$, $x_0,x_1,\ldots,x_T \in \mathbb{R}^{d}$ that

Figures (6)

  • Figure 2.1: Graphical illustration the forward process $X^{\varnothing}$ and the backward process $(X^{\theta})_{\theta \in \mathbb{R}^\mathfrak{d}}$ in with Markov assumptions in \ref{['setting01']}.
  • Figure 3.1: Graphical illustration of $(\tilde{\alpha}_t)_{t \in \{1,\ldots,T\}}$ in \ref{['setting:dnn3']} for $T=1000$ and $(\alpha_t)_{t \in \{1,\ldots,T\}}$ given as in \ref{['setting:dnn3:alpha']}.
  • Figure 3.2: Graphical illustration of a typical UNet architecture in case of two dimensional data (e.g images). In yellow the convolutions, in red the max pooling operations, in blue the transpose convolutions. During each max pooling operation in the encoder network (left side), we increase the number of channels twofold and reduce the spatial dimensions by half. Conversely, in each transpose convolution in the decoder network (right side), we reduce the number of channels by half and double the spatial dimensions. In the decoder part we concatenate encoder's feature map with decoder's feature maps.
  • Figure 3.3: Sinusoidal time embedding for $1000$ time step using as embedding dimension $64$.
  • Figure 4.1: Evaluation of text-conditional image synthesis on the $256\times256$ sized MS-COCO lin2015microsoft.
  • ...and 1 more figures

Theorems & Definitions (48)

  • Remark 2.2: Explanations for \ref{['setting0']}
  • Remark 2.4: Transition kernels and transition densities in \ref{['setting01']}
  • Lemma 2.5: Representation for marginal in with Markov assumptions
  • Definition 2.6
  • Definition 2.7: divergence
  • Lemma 2.8: Properties of the and the divergence
  • Lemma 2.9: Upper bounds for objective in
  • Remark 2.10: Explanations for \ref{['lemma:upperboundE']}
  • Remark 2.12: Explanations for \ref{['setting_base']}
  • Definition 3.1: Gaussian
  • ...and 38 more