Table of Contents
Fetching ...

Hierarchical VAE with a Diffusion-based VampPrior

Anna Kuzina, Jakub M. Tomczak

TL;DR

This work targets training instability and inefficiency in deep hierarchical VAEs by introducing a diffusion-based VampPrior with amortized pseudoinputs and a diffusion-based prior over pseudoinputs. The model extends the Ladder VAE framework with a non-trainable transformation (DCT) to generate compact pseudoinputs and employs latent aggregation to ensure all latent levels remain active, achieving strong performance with fewer parameters and layers. The key contributions include amortized VampPrior for all hierarchical levels, DCT-based pseudoinputs, a diffusion-based prior over pseudoinputs, and a latent-aggregation mechanism that improves latent-space utilization, validated on MNIST, OMNIGLOT, and CIFAR-10. These advances yield improved negative log-likelihood and bits-per-dimension metrics, reduced memory demands, and stable training without the hacks often required for deep hierarchical VAEs, enabling scalable high-capacity generative modeling. The approach has practical impact for deploying deep hierarchical VAEs in resource-constrained settings while maintaining sample quality and latent-disentanglement properties.

Abstract

Deep hierarchical variational autoencoders (VAEs) are powerful latent variable generative models. In this paper, we introduce Hierarchical VAE with Diffusion-based Variational Mixture of the Posterior Prior (VampPrior). We apply amortization to scale the VampPrior to models with many stochastic layers. The proposed approach allows us to achieve better performance compared to the original VampPrior work and other deep hierarchical VAEs, while using fewer parameters. We empirically validate our method on standard benchmark datasets (MNIST, OMNIGLOT, CIFAR10) and demonstrate improved training stability and latent space utilization.

Hierarchical VAE with a Diffusion-based VampPrior

TL;DR

This work targets training instability and inefficiency in deep hierarchical VAEs by introducing a diffusion-based VampPrior with amortized pseudoinputs and a diffusion-based prior over pseudoinputs. The model extends the Ladder VAE framework with a non-trainable transformation (DCT) to generate compact pseudoinputs and employs latent aggregation to ensure all latent levels remain active, achieving strong performance with fewer parameters and layers. The key contributions include amortized VampPrior for all hierarchical levels, DCT-based pseudoinputs, a diffusion-based prior over pseudoinputs, and a latent-aggregation mechanism that improves latent-space utilization, validated on MNIST, OMNIGLOT, and CIFAR-10. These advances yield improved negative log-likelihood and bits-per-dimension metrics, reduced memory demands, and stable training without the hacks often required for deep hierarchical VAEs, enabling scalable high-capacity generative modeling. The approach has practical impact for deploying deep hierarchical VAEs in resource-constrained settings while maintaining sample quality and latent-disentanglement properties.

Abstract

Deep hierarchical variational autoencoders (VAEs) are powerful latent variable generative models. In this paper, we introduce Hierarchical VAE with Diffusion-based Variational Mixture of the Posterior Prior (VampPrior). We apply amortization to scale the VampPrior to models with many stochastic layers. The proposed approach allows us to achieve better performance compared to the original VampPrior work and other deep hierarchical VAEs, while using fewer parameters. We empirically validate our method on standard benchmark datasets (MNIST, OMNIGLOT, CIFAR10) and demonstrate improved training stability and latent space utilization.

Paper Structure

This paper contains 38 sections, 25 equations, 10 figures, 9 tables, 2 algorithms.

Figures (10)

  • Figure 1: Graphical model of the TopDown hierarchical VAE with three latent variables (a) without pseudoinputs and (b) with pseudoinputs. The inference model (left) and the generative model (right) share parameters in the TopDown path (blue). The dashed arrow represents a non-trainable transformation.
  • Figure 2: A diagram of the DVP-VAE: TopDown hierarchical VAE with the diffusion-based VampPrior. (a) A BottomUp path (left) and a TopDown path (right). (b) A TopDown block that takes features from the block above $\mathbf{h}^{dec}$, encoder features $\mathbf{h}^{enc}$ (only during training) and a pseudoinput ${\mathbf{u}}$ as inputs. (c) A single Resnet block. (d) A single pseudoinput block.
  • Figure 3: Unconditional samples from the Diffusion-based VampPrior (top) and corresponding samples from the DVP-VAE (bottom).
  • Figure 4: Generative reconstructions. The top row is using a pseudoinput sampled from $r({\mathbf{u}}|{\mathbf{x}})$only.
  • Figure 5: Ablation study of for the pseudoinputs type (DCT and Downsampled image), pseudoinputs prior (Diffusion model and Mixture of Gaussians) and pseudoinputs size (ranging from $3\times 3$ to $11\times 11$). Each configuration is trained with four different random seeds.
  • ...and 5 more figures