Table of Contents
Fetching ...

Towards Deeper Understanding of Variational Autoencoding Models

Shengjia Zhao, Jiaming Song, Stefano Ermon

TL;DR

The paper tackles the limitations of standard variational autoencoders on complex datasets by introducing a general optimization framework not strictly tied to variational Bayes. It models p_theta(x,z)=p(z)p_theta(x|z) with a flexible conditional family P and a learnable inference distribution q to foster discriminative latent features, and explains why simple Gaussian conditionals cause blur. A key contribution is the sequential VAE with infusion-inspired latent augmentation, enabling sharp LSUN-like samples even with pixel-wise L2 reconstruction; it also analyzes the information preference that can arise with powerful decoders and provides strategies to maintain latent usage. Empirical results demonstrate improved sample sharpness and meaningful latent representations, offering a principled path to better unsupervised feature learning in deep generative models.

Abstract

We propose a new family of optimization criteria for variational auto-encoding models, generalizing the standard evidence lower bound. We provide conditions under which they recover the data distribution and learn latent features, and formally show that common issues such as blurry samples and uninformative latent features arise when these conditions are not met. Based on these new insights, we propose a new sequential VAE model that can generate sharp samples on the LSUN image dataset based on pixel-wise reconstruction loss, and propose an optimization criterion that encourages unsupervised learning of informative latent features.

Towards Deeper Understanding of Variational Autoencoding Models

TL;DR

The paper tackles the limitations of standard variational autoencoders on complex datasets by introducing a general optimization framework not strictly tied to variational Bayes. It models p_theta(x,z)=p(z)p_theta(x|z) with a flexible conditional family P and a learnable inference distribution q to foster discriminative latent features, and explains why simple Gaussian conditionals cause blur. A key contribution is the sequential VAE with infusion-inspired latent augmentation, enabling sharp LSUN-like samples even with pixel-wise L2 reconstruction; it also analyzes the information preference that can arise with powerful decoders and provides strategies to maintain latent usage. Empirical results demonstrate improved sample sharpness and meaningful latent representations, offering a principled path to better unsupervised feature learning in deep generative models.

Abstract

We propose a new family of optimization criteria for variational auto-encoding models, generalizing the standard evidence lower bound. We provide conditions under which they recover the data distribution and learn latent features, and formally show that common issues such as blurry samples and uninformative latent features arise when these conditions are not met. Based on these new insights, we propose a new sequential VAE model that can generate sharp samples on the LSUN image dataset based on pixel-wise reconstruction loss, and propose an optimization criterion that encourages unsupervised learning of informative latent features.

Paper Structure

This paper contains 20 sections, 6 theorems, 46 equations, 7 figures.

Key Result

Proposition 1

Let $\theta^*$ be the global optimum of $\mathcal{L}$ defined in (equ:model_with_q), and $f_{\theta^*}$ the corresponding optimal mapping. If $\mathcal{F}$ has sufficient capacity, then for every $z \in \mathcal{Z}$

Figures (7)

  • Figure 1: Illustration of variational approximation of $q(x|z)$ by $\mathcal{P}$. Left: for each $z \in \mathcal{Z}$ we use the optimal member of $\mathcal{P}$ to approximate $q(x|z)$. Right: this approximation requires $\mathcal{P}$ to be large enough so that it covers the true posterior $q(x|z)$ for any $z$.
  • Figure 2: $\sum_{i} Var_{q_\phi(x|z)}[x_i]$ plotted on latent space, red corresponds to high variance, and blue low variance. Plotted digits are the generated $g_\theta(z)$ at any $z$. Digits on high variance regions are fuzzy while digits on low variance regions are generally well generated. (Best viewed on screen)
  • Figure 3: Infusion Training (Left) vs. Sequential VAE (Right). For Infusion Training, at each step some random pixels from real data are added to the previous reconstruction. Based on the newly added pixels the model makes a new attempt at reconstruction. Sequential VAE is a generalization of this idea. At each step some features are extracted from real data. The network makes a new attempt at reconstruction based on previous results and the new information.
  • Figure 4: Sequential VAE on CelebA and LSUN. Each column corresponds to a step in the sequence (starting from noise); in particular, the second is what a regular VAE with the same architecture generates. We see increasingly sharp images and addition of details with more iterations (from left to right).
  • Figure 5: Mutual information vs sample quality for VAE with PixelCNN as family $\mathcal{P}$. Top row: Pixel VAE optimized on ELBO bound. Bottom row: Pixel VAE optimized without regularization. For ELBO ancestral sampling (Left) $p(z)p_\theta(x|z)$ produces similar quality samples as Markov chain (Middle), while for unregularized VAE ancestral sampling produces unsensible samples, while Markov chain produces samples of similar quality as ELBO. Right: evolution of estimated mutual information and per-pixel negative log likelihood loss. For ELBO, mutual information is driven to zero, indicating unused latent code, while without regularization large mutual information is preferred. Details on the mutual information approximation is in the Appendix.
  • ...and 2 more figures

Theorems & Definitions (10)

  • Definition 1
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Proposition 5
  • Proposition 6
  • proof : Proof of Proposition \ref{['prop:optimal_solution']}\ref{['prop:condition']}\ref{['prop:marginal_condition']}
  • proof : Proof of Proposition \ref{['prop:reconstruction_error']}
  • proof : Proof of Proposition \ref{['prop:discrete_optimum']}