Table of Contents
Fetching ...

Tutorial on Diffusion Models for Imaging and Vision

Stanley H. Chan

TL;DR

This tutorial surveys diffusion-based techniques for imaging and vision, tracing the lineage from variational autoencoders to diffusion probabilistic models and score-based methods. It unifies multiple perspectives by detailing VAE ELBO foundations, DDPM/DDIM frameworks, and score-matching Langevin dynamics, all framed within stochastic differential equations and Fokker–Planck theory. The work clarifies training and inference schemes, noise-prediction paradigms, and numerical solvers, while highlighting practical accelerations and connections to classical inference. It serves as a comprehensive, technical primer for undergraduates and graduates aiming to research or apply diffusion-based imaging methods.

Abstract

The astonishing growth of generative tools in recent years has empowered many exciting applications in text-to-image generation and text-to-video generation. The underlying principle behind these generative tools is the concept of diffusion, a particular sampling mechanism that has overcome some shortcomings that were deemed difficult in the previous approaches. The goal of this tutorial is to discuss the essential ideas underlying the diffusion models. The target audience of this tutorial includes undergraduate and graduate students who are interested in doing research on diffusion models or applying these models to solve other problems.

Tutorial on Diffusion Models for Imaging and Vision

TL;DR

This tutorial surveys diffusion-based techniques for imaging and vision, tracing the lineage from variational autoencoders to diffusion probabilistic models and score-based methods. It unifies multiple perspectives by detailing VAE ELBO foundations, DDPM/DDIM frameworks, and score-matching Langevin dynamics, all framed within stochastic differential equations and Fokker–Planck theory. The work clarifies training and inference schemes, noise-prediction paradigms, and numerical solvers, while highlighting practical accelerations and connections to classical inference. It serves as a comprehensive, technical primer for undergraduates and graduates aiming to research or apply diffusion-based imaging methods.

Abstract

The astonishing growth of generative tools in recent years has empowered many exciting applications in text-to-image generation and text-to-video generation. The underlying principle behind these generative tools is the concept of diffusion, a particular sampling mechanism that has overcome some shortcomings that were deemed difficult in the previous approaches. The goal of this tutorial is to discuss the essential ideas underlying the diffusion models. The target audience of this tutorial includes undergraduate and graduate students who are interested in doing research on diffusion models or applying these models to solve other problems.
Paper Structure (31 sections, 31 theorems, 414 equations, 38 figures, 1 table, 1 algorithm)

This paper contains 31 sections, 31 theorems, 414 equations, 38 figures, 1 table, 1 algorithm.

Key Result

Theorem 1.1

Decomposition of Log-Likelihood. The log likelihood $\log p(\mathbf{x})$ can be decomposed as

Figures (38)

  • Figure 1.1: A variational autoencoder consists of an encoder that converts an input $\mathbf{x}$ to a latent variable $\mathbf{z}$, and a decoder that synthesizes an output $\widehat{\mathbf{x}}$ from the latent variable.
  • Figure 1.2: In discrete cosine transform (DCT), we can think of the encoder as taking an image $\mathbf{x}$ and generating a latent variable $\mathbf{z}$ by projecting $\mathbf{x}$ onto the basis functions.
  • Figure 1.3: In a variational autoencoder, the variables $\mathbf{x}$ and $\mathbf{z}$ are connected by the conditional distributions $p(\mathbf{x}|\mathbf{z})$ and $p(\mathbf{z}|\mathbf{x})$. To make things work, we introduce proxy distributions $p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})$ and $q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$.
  • Figure 1.4: Visualization of $\log p(\mathbf{x})$ and ELBO. The gap between the two is determined by the KL divergence $\mathbb{D}_{\text{KL}}( q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}|\mathbf{x}))$.
  • Figure 1.5: Implementation of a VAE encoder. We use a neural network to take the image $\mathbf{x}$ and estimate the mean $\boldsymbol{\mu}_{\boldsymbol{\phi}}$ and variance $\sigma^2_{\boldsymbol{\phi}}$ of the Gaussian distribution.
  • ...and 33 more figures

Theorems & Definitions (73)

  • Definition 1.1
  • Example 1.1
  • Definition 1.2
  • Example 1.2
  • Example 1.3
  • Definition 1.3
  • Theorem 1.1
  • Example 1.4
  • Theorem 1.2
  • Example 1.5
  • ...and 63 more