Demystifying Variational Diffusion Models

Fabio De Sousa Ribeiro; Ben Glocker

Demystifying Variational Diffusion Models

Fabio De Sousa Ribeiro, Ben Glocker

TL;DR

Demystifying Variational Diffusion Models presents a cohesive, graph-based, variational treatment that places diffusion models within the top-down HLVM framework. It shows how forward Gaussian diffusion, a fixed top-down posterior, and a shared denoising/generative network yield a tractable ELBO that diffusion losses approximate as a weighted integral over noise levels; diffusion can be viewed as an infinitely deep HLVM in the continuous-time limit. The work clarifies multiple diffusion-objective parameterizations (image denoising, noise, score, energy, velocity, flow) and proves their equivalence through linear relationships, while detailing invariances to the forward-noise schedule and practical estimation techniques. It further discusses practical choices (weighting, importance sampling, data augmentation) and provides guidance for future work on representation learning, broader forward processes, and causal interpretations, highlighting diffusion models’ balance between ML objectives and perceptual quality.

Abstract

Despite the growing interest in diffusion models, gaining a deep understanding of the model class remains an elusive endeavour, particularly for the uninitiated in non-equilibrium statistical physics. Thanks to the rapid rate of progress in the field, most existing work on diffusion models focuses on either applications or theoretical contributions. Unfortunately, the theoretical material is often inaccessible to practitioners and new researchers, leading to a risk of superficial understanding in ongoing research. Given that diffusion models are now an indispensable tool, a clear and consolidating perspective on the model class is needed to properly contextualize recent advances in generative modelling and lower the barrier to entry for new researchers. To that end, we revisit predecessors to diffusion models like hierarchical latent variable models and synthesize a holistic perspective using only directed graphical modelling and variational inference principles. The resulting narrative is easier to follow as it imposes fewer prerequisites on the average reader relative to the view from non-equilibrium thermodynamics or stochastic differential equations.

Demystifying Variational Diffusion Models

TL;DR

Abstract

Paper Structure (62 sections, 142 equations, 8 figures, 4 tables)

This paper contains 62 sections, 142 equations, 8 figures, 4 tables.

Introduction
Latent Variable Models
Variational Autoencoder
Hierarchical Latent Variable Models
Generative Feedback
Ladder Networks.
Top-down Inference
Variational Lower Bound.
The W[H]ole Problem
Variational Diffusion Models
On Representation Learning
Forward Process: Gaussian Diffusion
Variance Preserving Process
Linear Gaussian Transitions
The Top-down Posterior
...and 47 more sections

Figures (8)

Figure 1: Probabilistic graphical model of a latent variable model (e.g. variational autoencoder). Directed arrows represent the assumed flow of conditional dependencies or causal influence between variables.
Figure 2: Hierarchical latent variable graphical models. (a) The generative model $p(\mathbf{x}, \mathbf{z}_{1:T})$ of a hierarchical VAE with $T$ latent variables is a Markov chain. (b) The standard bottom-up inference model $q(\mathbf{z}_{1:T} \mid \mathbf{x})$ of a hierarchical VAE is a Markov chain in the reverse direction. (c) The top-down inference model follows the same topological ordering of the latent variables as the generative model. Notably, this top-down structure is also used to specify diffusion models. However, in diffusion models the posterior $q(\mathbf{z}_{1:T} \mid \mathbf{x})$ is tractable due to Gaussian conjugacy, which enables us to specify the generative model transitions as $p(\mathbf{z}_{t-1} \mid \mathbf{z}_{t}) = q(\mathbf{z}_{t-1} \mid \mathbf{z}_{t}, \mathbf{x} = \hat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{z}_t; t))$, where the data $\mathbf{x}$ is replaced by an image denoising model $\hat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{z}_t; t)$.
Figure 3: A Ladder Network. The latent variables $\mathbf{z}_1, \mathbf{z}_2, \dots, \mathbf{z}_T$ are noisy representations of $\mathbf{x}$, and $\mathbf{d}_1, \mathbf{d}_2, \dots, \mathbf{d}_T$ are clean representations; both sets are produced by a shared encoder (blue arrows). The variables $\hat{\mathbf{z}}_1, \hat{\mathbf{z}}_2, \dots, \hat{\mathbf{z}}_T$ are outputs of denoising functions where $\hat{\mathbf{z}}_t = g_t(\mathbf{z}_t, \hat{\mathbf{z}}_{t+1})$. Notice how $g_t(\cdot)$ receives both bottom-up and top-down information. The dashed horizontal lines denote local cost functions used to minimize $\|\hat{\mathbf{z}}_t - \mathbf{d}_t \|^2_2$. The main difference compared to denoising diffusion models is that here the denoising targets $\mathbf{d}_t$ are learned representations of $\mathbf{x}$ rather than fixed, increasingly noisier versions of $\mathbf{x}$.
Figure 4: Demonstration of the 'hole problem'. Results are from a single stochastic layer VAE trained on a 2D toy dataset with five clusters. The latent variable $\mathbf{z}$ is also 2-dimensional for illustration purposes. The leftmost column shows the dataset, overlaid with reconstructed datapoints (red border) and random samples from the generative model (blue border). The remaining columns show the assumed prior $p(\mathbf{z}) = \mathcal{N}(\mathbf{z};0, \mathbf{I})$ (blue contours) overlaid with the aggregate posterior $q(\mathbf{z}) = \sum_{i=1}^N q(\mathbf{z} \mid \mathbf{x}_i) / N$. As shown, there are regions with high density under the prior which are assigned low density under the aggregate posterior. This affects the quality of the random samples since we are likely to sample from regions in $p(\mathbf{z})$ not covered by the data. Further, the bottom row shows a common occurrence in VAEs where latent variable(s) are not activated/used at all by the model, in this case, $\mathbf{z}_2$ was not used.
Figure 5: Probabilistic graphical models of HVAEs and diffusion models. (a) The general top-down hierarchical latent variable model. (b) The top-down model used to specify diffusion models, where $q(\mathbf{z}_T \mid \mathbf{x}) = q(\mathbf{z}_T)$ by construction. Here the posterior $q(\mathbf{z}_{1:T} \mid \mathbf{x})$ is a fixed noising process, so the modelling task is bottom-up prediction of $\mathbf{x}$ from each $\mathbf{z}_t$, i.e. denoising (dashed lines). (c) The top-down model used for posterior inference in HVAEs. It consists of a deterministic bottom-up pass to compute $\mathbf{d}_1,\dots,\mathbf{d}_T$, followed a stochastic top-down pass to compute $\mathbf{z}_T,\dots,\mathbf{z}_1$. (d) The reverse process of a diffusion model, i.e. the generative model. The main differences compared to (c) are that here the deterministic variables $\mathbf{d}_{T-1},\dots,\mathbf{d}_1$ do not depend on $\mathbf{x}$ nor have their own hierarchical dependencies. Further, the blue lines represent a denoising model $\hat{\mathbf{x}}_{\boldsymbol{\theta}} :\mathbf{z}_t \to \mathbf{d}_t$ which is shared across the hierarchy.
...and 3 more figures

Demystifying Variational Diffusion Models

TL;DR

Abstract

Demystifying Variational Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)