Learning Latent Space Hierarchical EBM Diffusion Models

Jiali Cui; Tian Han

Learning Latent Space Hierarchical EBM Diffusion Models

Jiali Cui, Tian Han

TL;DR

The paper addresses the prior-hole problem in multi-layer latent-variable generators that rely on Gaussian priors by introducing a diffusion-based learning framework for an energy-based prior. It constructs a sequence of conditional EBMs operating in a uni-scale latent space ${\tilde{\mathbf{u}}}$, with a forward diffusion that preserves hierarchical structure via ${\mathbf{\tilde{z}}}=T_{\beta_{>0}}({\mathbf{\tilde{u}}})$ and a reverse Langevin-based sampling guided by energies $F_{\omega}(T_{\beta_{>0}}({\mathbf{\tilde{u}}}_t), t)$. The approach enables tractable EBM learning, improved sample quality, and controllable, hierarchical generation through coupling with symbol vectors ${\mathbf{y}}$ and layer-wise energy terms. Experiments on CIFAR-10, CelebA-HQ-256, and LSUN-Church-64 demonstrate competitive synthesis quality, interpretable hierarchical representations, and effective controllable synthesis, highlighting the practical impact for expressive hierarchical generative modeling.

Abstract

This work studies the learning problem of the energy-based prior model and the multi-layer generator model. The multi-layer generator model, which contains multiple layers of latent variables organized in a top-down hierarchical structure, typically assumes the Gaussian prior model. Such a prior model can be limited in modelling expressivity, which results in a gap between the generator posterior and the prior model, known as the prior hole problem. Recent works have explored learning the energy-based (EBM) prior model as a second-stage, complementary model to bridge the gap. However, the EBM defined on a multi-layer latent space can be highly multi-modal, which makes sampling from such marginal EBM prior challenging in practice, resulting in ineffectively learned EBM. To tackle the challenge, we propose to leverage the diffusion probabilistic scheme to mitigate the burden of EBM sampling and thus facilitate EBM learning. Our extensive experiments demonstrate a superior performance of our diffusion-learned EBM prior on various challenging tasks.

Learning Latent Space Hierarchical EBM Diffusion Models

TL;DR

, with a forward diffusion that preserves hierarchical structure via

and a reverse Langevin-based sampling guided by energies

. The approach enables tractable EBM learning, improved sample quality, and controllable, hierarchical generation through coupling with symbol vectors

and layer-wise energy terms. Experiments on CIFAR-10, CelebA-HQ-256, and LSUN-Church-64 demonstrate competitive synthesis quality, interpretable hierarchical representations, and effective controllable synthesis, highlighting the practical impact for expressive hierarchical generative modeling.

Abstract

Paper Structure (22 sections, 22 equations, 10 figures, 3 tables, 2 algorithms)

This paper contains 22 sections, 22 equations, 10 figures, 3 tables, 2 algorithms.

Introduction
Preliminary
Multi-layer Latent Variable Model
Energy-based Prior Model.
Methodology
Diffusion with Multi-layer Latent Variables
Reverse with Multi-layer Latent Variables
Coupling with symbol vector
Related Work
Experiment
Image Synthesis
Hierarchical Representation
Controllable Synthesis
Langevin Transition for Energy Landscape
Ablation Studies
...and 7 more sections

Figures (10)

Figure 1: Image synthesis on CelebA-HQ-256 (left), LSUN-Church-64 (center) and CIFAR-10 (right).
Figure 2: Hierarchical sampling. Visualization of representations learned by latent variables from the top to bottom layers, arranged as top-left, top-right, bottom-left and bottom-right.
Figure 3: AUROC results for energy scores of different layers (denoted as $L>k$ for using top layers above $k$-th layer). Top figure visualizes the comparison between the diffusion scheme (${\mathbf{\tilde{u}}}_0$ sampled from EBM) and the inference scheme (${\mathbf{\tilde{u}}}_0$ inferred from inference model) in different layers. Bottom figure is the histogram of energy scores using all layers $L>0$ and top layers $L>27$. Total number of layers is 30.
Figure 4: Fine-tuned image synthesis with multiple attributes on CelebA-64.
Figure 5: Controllable synthesis on CIFAR-10.
...and 5 more figures

Learning Latent Space Hierarchical EBM Diffusion Models

TL;DR

Abstract

Learning Latent Space Hierarchical EBM Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)