Table of Contents
Fetching ...

Likelihood Training of Cascaded Diffusion Models via Hierarchical Volume-preserving Maps

Henry Li, Ronen Basri, Yuval Kluger

TL;DR

This work addresses the intractable likelihood problem in cascaded diffusion models by introducing hierarchical volume-preserving maps (HVPM), such as Laplacian pyramids and wavelet transforms, under which the data likelihood remains invariant $p_\theta(\mathbf{x}) = p_\theta(h(\mathbf{x}))$. This enables an exact, scale-wise likelihood decomposition $\log p_\theta(\mathbf{x}) = \log p_\theta(\mathbf{z}^{(1)}) + \sum_{s=2}^S \log p_\theta(\mathbf{z}^{(s)}|\mathbf{z}^{(<s)})$ and a practical training objective $\mathcal{C}(\mathbf{x}) = \mathcal{L}(\mathbf{z}^{(1)}) + \sum_{s=2}^S \mathcal{L}(\mathbf{z}^{(s)}|\mathbf{z}^{(<s)})$. The authors further connect likelihood training to Earth Mover's Distance via an OT-based bound, enabling linear-time estimation of a perceptual transport cost. Empirically, LP-PCDM and W-PCDM achieve state-of-the-art likelihoods on image benchmarks, improved lossless compression, and enhanced OOD detection, demonstrating the practical value of multi-scale likelihood modeling with HVPM. The work thus paves the way for robust, likelihood-based training and evaluation in multi-scale diffusion frameworks, while highlighting theoretical links to OT and perceptual metrics.

Abstract

Cascaded models are multi-scale generative models with a marked capacity for producing perceptually impressive samples at high resolutions. In this work, we show that they can also be excellent likelihood models, so long as we overcome a fundamental difficulty with probabilistic multi-scale models: the intractability of the likelihood function. Chiefly, in cascaded models each intermediary scale introduces extraneous variables that cannot be tractably marginalized out for likelihood evaluation. This issue vanishes by modeling the diffusion process on latent spaces induced by a class of transformations we call hierarchical volume-preserving maps, which decompose spatially structured data in a hierarchical fashion without introducing local distortions in the latent space. We demonstrate that two such maps are well-known in the literature for multiscale modeling: Laplacian pyramids and wavelet transforms. Not only do such reparameterizations allow the likelihood function to be directly expressed as a joint likelihood over the scales, we show that the Laplacian pyramid and wavelet transform also produces significant improvements to the state-of-the-art on a selection of benchmarks in likelihood modeling, including density estimation, lossless compression, and out-of-distribution detection. Investigating the theoretical basis of our empirical gains we uncover deep connections to score matching under the Earth Mover's Distance (EMD), which is a well-known surrogate for perceptual similarity. Code can be found at \href{https://github.com/lihenryhfl/pcdm}{this https url}.

Likelihood Training of Cascaded Diffusion Models via Hierarchical Volume-preserving Maps

TL;DR

This work addresses the intractable likelihood problem in cascaded diffusion models by introducing hierarchical volume-preserving maps (HVPM), such as Laplacian pyramids and wavelet transforms, under which the data likelihood remains invariant . This enables an exact, scale-wise likelihood decomposition and a practical training objective . The authors further connect likelihood training to Earth Mover's Distance via an OT-based bound, enabling linear-time estimation of a perceptual transport cost. Empirically, LP-PCDM and W-PCDM achieve state-of-the-art likelihoods on image benchmarks, improved lossless compression, and enhanced OOD detection, demonstrating the practical value of multi-scale likelihood modeling with HVPM. The work thus paves the way for robust, likelihood-based training and evaluation in multi-scale diffusion frameworks, while highlighting theoretical links to OT and perceptual metrics.

Abstract

Cascaded models are multi-scale generative models with a marked capacity for producing perceptually impressive samples at high resolutions. In this work, we show that they can also be excellent likelihood models, so long as we overcome a fundamental difficulty with probabilistic multi-scale models: the intractability of the likelihood function. Chiefly, in cascaded models each intermediary scale introduces extraneous variables that cannot be tractably marginalized out for likelihood evaluation. This issue vanishes by modeling the diffusion process on latent spaces induced by a class of transformations we call hierarchical volume-preserving maps, which decompose spatially structured data in a hierarchical fashion without introducing local distortions in the latent space. We demonstrate that two such maps are well-known in the literature for multiscale modeling: Laplacian pyramids and wavelet transforms. Not only do such reparameterizations allow the likelihood function to be directly expressed as a joint likelihood over the scales, we show that the Laplacian pyramid and wavelet transform also produces significant improvements to the state-of-the-art on a selection of benchmarks in likelihood modeling, including density estimation, lossless compression, and out-of-distribution detection. Investigating the theoretical basis of our empirical gains we uncover deep connections to score matching under the Earth Mover's Distance (EMD), which is a well-known surrogate for perceptual similarity. Code can be found at \href{https://github.com/lihenryhfl/pcdm}{this https url}.
Paper Structure (25 sections, 6 theorems, 50 equations, 3 figures, 6 tables)

This paper contains 25 sections, 6 theorems, 50 equations, 3 figures, 6 tables.

Key Result

Lemma 4.1

Let $h$ be a hierarchical volume-preserving map such that $h(\mathbf{x}) = (\mathbf{z}^{(1)}, \mathbf{z}^{(2)}, \dots, \mathbf{z}^{(S)})$, and $p_\theta$ be a likelihood function on $\mathbf{z}^{(1)}, \mathbf{z}^{(2)}, \dots, \mathbf{z}^{(S)}$. Then the likelihood function with respect to the origin

Figures (3)

  • Figure 1: Images generated from our W-PCDM model trained on unconditional ImageNet 128x128.
  • Figure 2: A Laplacian pyramid hierarchy with $S=4$. Left to right: $z^{(1)}, \dots, z^{(4)}$.
  • Figure 3: A wavelet hierarchy with $S=4$. Left to right: $z^{(1)}, \dots, z^{(4)}$.

Theorems & Definitions (11)

  • Definition 1: Volume-preserving Maps berger2012differential
  • Lemma 4.1: Probabilistic Invariance of Hierarchical Volume-preserving Maps
  • Definition 2: Wasserstein $p$-Metric
  • Theorem 5.1: Cascaded Diffusion Modeling and EMD Score Matching
  • Lemma A.1: Probabilistic Invariance of Hierarchical Volume-preserving Maps
  • proof
  • Theorem A.1: Cascaded Diffusion Modeling and EMD Score Matching
  • proof
  • Lemma A.1: From Theorem 2 in shirdhonkar2008approximate
  • Lemma A.2
  • ...and 1 more