Table of Contents
Fetching ...

Foundations of Diffusion Models in General State Spaces: A Self-Contained Introduction

Vincent Pauline, Tobias Höppe, Kirill Neklyudov, Alexander Tong, Stefan Bauer, Andrea Dittadi

TL;DR

This work provides a unified, self-contained treatment of diffusion models across both continuous and discrete state spaces. It starts from a discrete-time forward/noising process, derives exact reverse dynamics conditioned on data, and then shows how, in the limit of infinitely many steps, these converge to continuous-time formulations (SDEs for continuous spaces and CTMCs for discrete spaces). The authors present a three-step recipe (define forward corruption, parameterize the reverse, maximize the ELBO), and then recast the theory in a general infinitesimal-generator framework that yields equivalent forward/reverse formulations and training objectives (denoising score matching in the continuous case and denoising score entropy in the discrete case). They also discuss latent-diffusion strategies, practical reverse‑process parameterizations, and extensions bridging continuous and discrete diffusion for discrete data. The generator perspective unifies the standard diffusion literature and provides a principled path to generalizations, including piecewise-deterministic and jump processes, with clear implications for scalable, versatile diffusion methods. The result is a compact, theory‑driven roadmap to modern diffusion methodology applicable to both real-valued and categorical data, and to diffusion in learned latent spaces.

Abstract

Although diffusion models now occupy a central place in generative modeling, introductory treatments commonly assume Euclidean data and seldom clarify their connection to discrete-state analogues. This article is a self-contained primer on diffusion over general state spaces, unifying continuous domains and discrete/categorical structures under one lens. We develop the discrete-time view (forward noising via Markov kernels and learned reverse dynamics) alongside its continuous-time limits -- stochastic differential equations (SDEs) in $\mathbb{R}^d$ and continuous-time Markov chains (CTMCs) on finite alphabets -- and derive the associated Fokker--Planck and master equations. A common variational treatment yields the ELBO that underpins standard training losses. We make explicit how forward corruption choices -- Gaussian processes in continuous spaces and structured categorical transition kernels (uniform, masking/absorbing and more) in discrete spaces -- shape reverse dynamics and the ELBO. The presentation is layered for three audiences: newcomers seeking a self-contained intuitive introduction; diffusion practitioners wanting a global theoretical synthesis; and continuous-diffusion experts looking for an analogy-first path into discrete diffusion. The result is a unified roadmap to modern diffusion methodology across continuous domains and discrete sequences, highlighting a compact set of reusable proofs, identities, and core theoretical principles.

Foundations of Diffusion Models in General State Spaces: A Self-Contained Introduction

TL;DR

This work provides a unified, self-contained treatment of diffusion models across both continuous and discrete state spaces. It starts from a discrete-time forward/noising process, derives exact reverse dynamics conditioned on data, and then shows how, in the limit of infinitely many steps, these converge to continuous-time formulations (SDEs for continuous spaces and CTMCs for discrete spaces). The authors present a three-step recipe (define forward corruption, parameterize the reverse, maximize the ELBO), and then recast the theory in a general infinitesimal-generator framework that yields equivalent forward/reverse formulations and training objectives (denoising score matching in the continuous case and denoising score entropy in the discrete case). They also discuss latent-diffusion strategies, practical reverse‑process parameterizations, and extensions bridging continuous and discrete diffusion for discrete data. The generator perspective unifies the standard diffusion literature and provides a principled path to generalizations, including piecewise-deterministic and jump processes, with clear implications for scalable, versatile diffusion methods. The result is a compact, theory‑driven roadmap to modern diffusion methodology applicable to both real-valued and categorical data, and to diffusion in learned latent spaces.

Abstract

Although diffusion models now occupy a central place in generative modeling, introductory treatments commonly assume Euclidean data and seldom clarify their connection to discrete-state analogues. This article is a self-contained primer on diffusion over general state spaces, unifying continuous domains and discrete/categorical structures under one lens. We develop the discrete-time view (forward noising via Markov kernels and learned reverse dynamics) alongside its continuous-time limits -- stochastic differential equations (SDEs) in and continuous-time Markov chains (CTMCs) on finite alphabets -- and derive the associated Fokker--Planck and master equations. A common variational treatment yields the ELBO that underpins standard training losses. We make explicit how forward corruption choices -- Gaussian processes in continuous spaces and structured categorical transition kernels (uniform, masking/absorbing and more) in discrete spaces -- shape reverse dynamics and the ELBO. The presentation is layered for three audiences: newcomers seeking a self-contained intuitive introduction; diffusion practitioners wanting a global theoretical synthesis; and continuous-diffusion experts looking for an analogy-first path into discrete diffusion. The result is a unified roadmap to modern diffusion methodology across continuous domains and discrete sequences, highlighting a compact set of reusable proofs, identities, and core theoretical principles.

Paper Structure

This paper contains 87 sections, 26 theorems, 382 equations, 5 figures, 1 table.

Key Result

Lemma 12

Consider a discrete-time diffusion process with one-step transitions: where $\mathbf{x}_t \in \mathbb{R}^d$ and $\{\tilde{\alpha}_t\}_{t \geq 1}$, $\{\tilde{\sigma}_t\}_{t \geq 1}$ are deterministic scalar sequences. Then for any $t \geq 1$: where $\alpha_t \coloneqq \prod_{j=1}^{t} \tilde{\alpha}_j$ and we use the convention that empty products equal $1$.

Figures (5)

  • Figure 1: Young researcher reading Foundations of Diffusion Models in General State Spaces: A Self-Contained Introduction. The left side illustrates key ideas from continuous-state diffusion models, and the right side highlights corresponding principles for discrete-state models. Image generated with gemini-3-pro-image-preview team2023gemini.
  • Figure 2: Visual roadmap of the manuscript. Three suggested reading paths are indicated: the introductory path (green) for newcomers to diffusion models, the advanced path (brown) for practitioners familiar with discrete-time diffusion seeking continuous-time theory, and the expert path (purple) for readers seeking the most general theoretical framework.
  • Figure 3: Unified perspective on diffusion models in continuous and discrete state spaces. (Top) An image $\mathbf{x}_0 \sim q_{\text{data}}$ is corrupted via a forward SDE to Gaussian noise $\mathbf{x}_1 \sim p_{\text{noise}}$; a reverse SDE reconstructs the clean data. (Bottom) A discrete sequence is corrupted via a forward CTMC through masking; a reverse CTMC recovers it. (Middle) Both processes are governed by infinitesimal generators $\mathscr{L}_t$ and Kolmogorov forward equations, providing a unified theoretical framework.
  • Figure 4: Latent diffusion example. Discrete text tokens are mapped to a continuous latent space via an encoder $p^{\boldsymbol{\phi}}$. In this latent space, one can perform continuous diffusion, following an SDE, to converge toward the noise distribution. The clean latent representation is then reconstructed using the reverse SDE. Finally, the decoder $p^{\boldsymbol{\psi}}$ maps the clean latent back to discrete text tokens.
  • Figure 5: For $d{=}2$, $O(\Delta t)$ moves are horizontal or vertical (single–coordinate), while diagonal two–coordinate moves are $O(\Delta t^2)$ and vanish in the rate matrix.

Theorems & Definitions (78)

  • Remark 1: Noise schedules
  • proof : Derivation sketch:
  • Remark 2: Discrete schedules & factorised $d$-dimensional process
  • proof : Derivation sketch
  • Remark 3: Off-diagonal positivity, divergence-free & partial derivative view
  • proof : Derivation sketch:
  • proof : Derivation sketch:
  • Remark 4: Dimensional factorisation
  • Remark 5: Hat convention for time reversal
  • proof : Derivation sketch.
  • ...and 68 more