Tutorial on Diffusion Models for Imaging and Vision

Stanley H. Chan

Tutorial on Diffusion Models for Imaging and Vision

Stanley H. Chan

TL;DR

This tutorial surveys diffusion-based techniques for imaging and vision, tracing the lineage from variational autoencoders to diffusion probabilistic models and score-based methods. It unifies multiple perspectives by detailing VAE ELBO foundations, DDPM/DDIM frameworks, and score-matching Langevin dynamics, all framed within stochastic differential equations and Fokker–Planck theory. The work clarifies training and inference schemes, noise-prediction paradigms, and numerical solvers, while highlighting practical accelerations and connections to classical inference. It serves as a comprehensive, technical primer for undergraduates and graduates aiming to research or apply diffusion-based imaging methods.

Abstract

The astonishing growth of generative tools in recent years has empowered many exciting applications in text-to-image generation and text-to-video generation. The underlying principle behind these generative tools is the concept of diffusion, a particular sampling mechanism that has overcome some shortcomings that were deemed difficult in the previous approaches. The goal of this tutorial is to discuss the essential ideas underlying the diffusion models. The target audience of this tutorial includes undergraduate and graduate students who are interested in doing research on diffusion models or applying these models to solve other problems.

Tutorial on Diffusion Models for Imaging and Vision

TL;DR

Abstract

Paper Structure (31 sections, 31 theorems, 414 equations, 38 figures, 1 table, 1 algorithm)

This paper contains 31 sections, 31 theorems, 414 equations, 38 figures, 1 table, 1 algorithm.

Variational Auto-Encoder (VAE)
Building Blocks of VAE
Evidence Lower Bound
Optimization in VAE
Concluding Remark
Denoising Diffusion Probabilistic Model (DDPM)
Building Blocks
Evidence Lower Bound
Distribution of the Reverse Process
Training and Inference
Predicting Noise
Denoising Diffusion Implicit Model (DDIM)
Concluding Remark
Score-Matching Langevin Dynamics (SMLD)
Sampling from a Distribution
...and 16 more sections

Key Result

Theorem 1.1

Decomposition of Log-Likelihood. The log likelihood $\log p(\mathbf{x})$ can be decomposed as

Figures (38)

Figure 1.1: A variational autoencoder consists of an encoder that converts an input $\mathbf{x}$ to a latent variable $\mathbf{z}$, and a decoder that synthesizes an output $\widehat{\mathbf{x}}$ from the latent variable.
Figure 1.2: In discrete cosine transform (DCT), we can think of the encoder as taking an image $\mathbf{x}$ and generating a latent variable $\mathbf{z}$ by projecting $\mathbf{x}$ onto the basis functions.
Figure 1.3: In a variational autoencoder, the variables $\mathbf{x}$ and $\mathbf{z}$ are connected by the conditional distributions $p(\mathbf{x}|\mathbf{z})$ and $p(\mathbf{z}|\mathbf{x})$. To make things work, we introduce proxy distributions $p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})$ and $q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$.
Figure 1.4: Visualization of $\log p(\mathbf{x})$ and ELBO. The gap between the two is determined by the KL divergence $\mathbb{D}_{\text{KL}}( q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}|\mathbf{x}))$.
Figure 1.5: Implementation of a VAE encoder. We use a neural network to take the image $\mathbf{x}$ and estimate the mean $\boldsymbol{\mu}_{\boldsymbol{\phi}}$ and variance $\sigma^2_{\boldsymbol{\phi}}$ of the Gaussian distribution.
...and 33 more figures

Theorems & Definitions (73)

Definition 1.1
Example 1.1
Definition 1.2
Example 1.2
Example 1.3
Definition 1.3
Theorem 1.1
Example 1.4
Theorem 1.2
Example 1.5
...and 63 more

Tutorial on Diffusion Models for Imaging and Vision

TL;DR

Abstract

Tutorial on Diffusion Models for Imaging and Vision

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (38)

Theorems & Definitions (73)