Tutorial on Diffusion Models for Imaging and Vision
Stanley H. Chan
TL;DR
This tutorial surveys diffusion-based techniques for imaging and vision, tracing the lineage from variational autoencoders to diffusion probabilistic models and score-based methods. It unifies multiple perspectives by detailing VAE ELBO foundations, DDPM/DDIM frameworks, and score-matching Langevin dynamics, all framed within stochastic differential equations and Fokker–Planck theory. The work clarifies training and inference schemes, noise-prediction paradigms, and numerical solvers, while highlighting practical accelerations and connections to classical inference. It serves as a comprehensive, technical primer for undergraduates and graduates aiming to research or apply diffusion-based imaging methods.
Abstract
The astonishing growth of generative tools in recent years has empowered many exciting applications in text-to-image generation and text-to-video generation. The underlying principle behind these generative tools is the concept of diffusion, a particular sampling mechanism that has overcome some shortcomings that were deemed difficult in the previous approaches. The goal of this tutorial is to discuss the essential ideas underlying the diffusion models. The target audience of this tutorial includes undergraduate and graduate students who are interested in doing research on diffusion models or applying these models to solve other problems.
