Likelihood Training of Cascaded Diffusion Models via Hierarchical Volume-preserving Maps

Henry Li; Ronen Basri; Yuval Kluger

Likelihood Training of Cascaded Diffusion Models via Hierarchical Volume-preserving Maps

Henry Li, Ronen Basri, Yuval Kluger

TL;DR

This work addresses the intractable likelihood problem in cascaded diffusion models by introducing hierarchical volume-preserving maps (HVPM), such as Laplacian pyramids and wavelet transforms, under which the data likelihood remains invariant $p_\theta(\mathbf{x}) = p_\theta(h(\mathbf{x}))$. This enables an exact, scale-wise likelihood decomposition $\log p_\theta(\mathbf{x}) = \log p_\theta(\mathbf{z}^{(1)}) + \sum_{s=2}^S \log p_\theta(\mathbf{z}^{(s)}|\mathbf{z}^{(<s)})$ and a practical training objective $\mathcal{C}(\mathbf{x}) = \mathcal{L}(\mathbf{z}^{(1)}) + \sum_{s=2}^S \mathcal{L}(\mathbf{z}^{(s)}|\mathbf{z}^{(<s)})$. The authors further connect likelihood training to Earth Mover's Distance via an OT-based bound, enabling linear-time estimation of a perceptual transport cost. Empirically, LP-PCDM and W-PCDM achieve state-of-the-art likelihoods on image benchmarks, improved lossless compression, and enhanced OOD detection, demonstrating the practical value of multi-scale likelihood modeling with HVPM. The work thus paves the way for robust, likelihood-based training and evaluation in multi-scale diffusion frameworks, while highlighting theoretical links to OT and perceptual metrics.

Abstract

Cascaded models are multi-scale generative models with a marked capacity for producing perceptually impressive samples at high resolutions. In this work, we show that they can also be excellent likelihood models, so long as we overcome a fundamental difficulty with probabilistic multi-scale models: the intractability of the likelihood function. Chiefly, in cascaded models each intermediary scale introduces extraneous variables that cannot be tractably marginalized out for likelihood evaluation. This issue vanishes by modeling the diffusion process on latent spaces induced by a class of transformations we call hierarchical volume-preserving maps, which decompose spatially structured data in a hierarchical fashion without introducing local distortions in the latent space. We demonstrate that two such maps are well-known in the literature for multiscale modeling: Laplacian pyramids and wavelet transforms. Not only do such reparameterizations allow the likelihood function to be directly expressed as a joint likelihood over the scales, we show that the Laplacian pyramid and wavelet transform also produces significant improvements to the state-of-the-art on a selection of benchmarks in likelihood modeling, including density estimation, lossless compression, and out-of-distribution detection. Investigating the theoretical basis of our empirical gains we uncover deep connections to score matching under the Earth Mover's Distance (EMD), which is a well-known surrogate for perceptual similarity. Code can be found at \href{https://github.com/lihenryhfl/pcdm}{this https url}.

Likelihood Training of Cascaded Diffusion Models via Hierarchical Volume-preserving Maps

TL;DR

. This enables an exact, scale-wise likelihood decomposition

and a practical training objective

. The authors further connect likelihood training to Earth Mover's Distance via an OT-based bound, enabling linear-time estimation of a perceptual transport cost. Empirically, LP-PCDM and W-PCDM achieve state-of-the-art likelihoods on image benchmarks, improved lossless compression, and enhanced OOD detection, demonstrating the practical value of multi-scale likelihood modeling with HVPM. The work thus paves the way for robust, likelihood-based training and evaluation in multi-scale diffusion frameworks, while highlighting theoretical links to OT and perceptual metrics.

Abstract

Paper Structure (25 sections, 6 theorems, 50 equations, 3 figures, 6 tables)

This paper contains 25 sections, 6 theorems, 50 equations, 3 figures, 6 tables.

Introduction
Related Work
Likelihood Training
Multiscale Generative Models
Diffusion Modeling with Earth Mover's Distances
Background
Hierarchical Volume-Preserving Maps
A Probabilistic Invariance
Standard Cascaded Hierarchy
Special Instances of Hierarchical Volume-preserving Maps
Laplacian Pyramids
Wavelet Decomposition
Likelihood Training of Cascaded Diffusion Models
Coupled Diffusion Modeling on a $h$-induced Latent Space
A Connection to Optimal Transport
...and 10 more sections

Key Result

Lemma 4.1

Let $h$ be a hierarchical volume-preserving map such that $h(\mathbf{x}) = (\mathbf{z}^{(1)}, \mathbf{z}^{(2)}, \dots, \mathbf{z}^{(S)})$, and $p_\theta$ be a likelihood function on $\mathbf{z}^{(1)}, \mathbf{z}^{(2)}, \dots, \mathbf{z}^{(S)}$. Then the likelihood function with respect to the origin

Figures (3)

Figure 1: Images generated from our W-PCDM model trained on unconditional ImageNet 128x128.
Figure 2: A Laplacian pyramid hierarchy with $S=4$. Left to right: $z^{(1)}, \dots, z^{(4)}$.
Figure 3: A wavelet hierarchy with $S=4$. Left to right: $z^{(1)}, \dots, z^{(4)}$.

Theorems & Definitions (11)

Definition 1: Volume-preserving Maps berger2012differential
Lemma 4.1: Probabilistic Invariance of Hierarchical Volume-preserving Maps
Definition 2: Wasserstein $p$-Metric
Theorem 5.1: Cascaded Diffusion Modeling and EMD Score Matching
Lemma A.1: Probabilistic Invariance of Hierarchical Volume-preserving Maps
proof
Theorem A.1: Cascaded Diffusion Modeling and EMD Score Matching
proof
Lemma A.1: From Theorem 2 in shirdhonkar2008approximate
Lemma A.2
...and 1 more

Likelihood Training of Cascaded Diffusion Models via Hierarchical Volume-preserving Maps

TL;DR

Abstract

Likelihood Training of Cascaded Diffusion Models via Hierarchical Volume-preserving Maps

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (11)