Hierarchical VAE with a Diffusion-based VampPrior
Anna Kuzina, Jakub M. Tomczak
TL;DR
This work targets training instability and inefficiency in deep hierarchical VAEs by introducing a diffusion-based VampPrior with amortized pseudoinputs and a diffusion-based prior over pseudoinputs. The model extends the Ladder VAE framework with a non-trainable transformation (DCT) to generate compact pseudoinputs and employs latent aggregation to ensure all latent levels remain active, achieving strong performance with fewer parameters and layers. The key contributions include amortized VampPrior for all hierarchical levels, DCT-based pseudoinputs, a diffusion-based prior over pseudoinputs, and a latent-aggregation mechanism that improves latent-space utilization, validated on MNIST, OMNIGLOT, and CIFAR-10. These advances yield improved negative log-likelihood and bits-per-dimension metrics, reduced memory demands, and stable training without the hacks often required for deep hierarchical VAEs, enabling scalable high-capacity generative modeling. The approach has practical impact for deploying deep hierarchical VAEs in resource-constrained settings while maintaining sample quality and latent-disentanglement properties.
Abstract
Deep hierarchical variational autoencoders (VAEs) are powerful latent variable generative models. In this paper, we introduce Hierarchical VAE with Diffusion-based Variational Mixture of the Posterior Prior (VampPrior). We apply amortization to scale the VampPrior to models with many stochastic layers. The proposed approach allows us to achieve better performance compared to the original VampPrior work and other deep hierarchical VAEs, while using fewer parameters. We empirically validate our method on standard benchmark datasets (MNIST, OMNIGLOT, CIFAR10) and demonstrate improved training stability and latent space utilization.
