Table of Contents
Fetching ...

Ladder Variational Autoencoders

Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, Ole Winther

TL;DR

The Ladder Variational Autoencoder reframes inference for deep hierarchical VAEs by introducing a ladder-style, top-down–consistent inference that combines bottom-up approximate likelihood with top-down priors. This two-pass, precision-weighted approach yields tighter log-likelihood bounds and superior generative performance, enabling deeper and more distributed latent representations. The authors demonstrate substantial gains on MNIST, OMNIGLOT, and NORB, and show that batch normalization and a deterministic warm-up schedule are essential for training deep stochastic models. The work positions LVAE as a practical, modular improvement to variational inference that can complement other advances like normalizing flows or semi-supervised extensions.

Abstract

Variational Autoencoders are powerful models for unsupervised learning. However deep models with several layers of dependent stochastic variables are difficult to train which limits the improvements obtained using these highly expressive models. We propose a new inference model, the Ladder Variational Autoencoder, that recursively corrects the generative distribution by a data dependent approximate likelihood in a process resembling the recently proposed Ladder Network. We show that this model provides state of the art predictive log-likelihood and tighter log-likelihood lower bound compared to the purely bottom-up inference in layered Variational Autoencoders and other generative models. We provide a detailed analysis of the learned hierarchical latent representation and show that our new inference model is qualitatively different and utilizes a deeper more distributed hierarchy of latent variables. Finally, we observe that batch normalization and deterministic warm-up (gradually turning on the KL-term) are crucial for training variational models with many stochastic layers.

Ladder Variational Autoencoders

TL;DR

The Ladder Variational Autoencoder reframes inference for deep hierarchical VAEs by introducing a ladder-style, top-down–consistent inference that combines bottom-up approximate likelihood with top-down priors. This two-pass, precision-weighted approach yields tighter log-likelihood bounds and superior generative performance, enabling deeper and more distributed latent representations. The authors demonstrate substantial gains on MNIST, OMNIGLOT, and NORB, and show that batch normalization and a deterministic warm-up schedule are essential for training deep stochastic models. The work positions LVAE as a practical, modular improvement to variational inference that can complement other advances like normalizing flows or semi-supervised extensions.

Abstract

Variational Autoencoders are powerful models for unsupervised learning. However deep models with several layers of dependent stochastic variables are difficult to train which limits the improvements obtained using these highly expressive models. We propose a new inference model, the Ladder Variational Autoencoder, that recursively corrects the generative distribution by a data dependent approximate likelihood in a process resembling the recently proposed Ladder Network. We show that this model provides state of the art predictive log-likelihood and tighter log-likelihood lower bound compared to the purely bottom-up inference in layered Variational Autoencoders and other generative models. We provide a detailed analysis of the learned hierarchical latent representation and show that our new inference model is qualitatively different and utilizes a deeper more distributed hierarchy of latent variables. Finally, we observe that batch normalization and deterministic warm-up (gradually turning on the KL-term) are crucial for training variational models with many stochastic layers.

Paper Structure

This paper contains 10 sections, 8 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Inference (or encoder/recognition) and generative (or decoder) models for a) VAE and b) LVAE. Circles are stochastic variables and diamonds are deterministic variables.
  • Figure 2: MNIST train (full lines) and test (dashed lines) set log-likelihood using one importance sample during training. The LVAE improves performance significantly over the regular VAE.
  • Figure 3: MNIST log-likelihood values for VAEs and the LVAE model with different number of latent layers, Batch normalization (BN) and Warm-up (WU). a) Train log-likelihood, b) test log-likelihood and c) test log-likelihood with 5000 importance samples.
  • Figure 4: $\log KL(q|p)$ for each latent unit is shown at different training epochs. Low $KL$ (white) corresponds to an inactive unit. The units are sorted for visualization. It is clear that vanilla VAE cannot train the higher latent layers, while introducing batch normalization helps. Warm-up creates more active units early in training, some of which are then gradually pruned away during training, resulting in a more distributed final representation. Lastly, we see that the LVAE activates the highest number of units in each layer.
  • Figure 5: Layer-wise $KL[q|p]$ divergence going from the lowest to the highest layers. In the VAE models the KL divergence is highest in the lowest layers whereas it is more distributed in the LVAE model
  • ...and 5 more figures