Towards Deeper Understanding of Variational Autoencoding Models
Shengjia Zhao, Jiaming Song, Stefano Ermon
TL;DR
The paper tackles the limitations of standard variational autoencoders on complex datasets by introducing a general optimization framework not strictly tied to variational Bayes. It models p_theta(x,z)=p(z)p_theta(x|z) with a flexible conditional family P and a learnable inference distribution q to foster discriminative latent features, and explains why simple Gaussian conditionals cause blur. A key contribution is the sequential VAE with infusion-inspired latent augmentation, enabling sharp LSUN-like samples even with pixel-wise L2 reconstruction; it also analyzes the information preference that can arise with powerful decoders and provides strategies to maintain latent usage. Empirical results demonstrate improved sample sharpness and meaningful latent representations, offering a principled path to better unsupervised feature learning in deep generative models.
Abstract
We propose a new family of optimization criteria for variational auto-encoding models, generalizing the standard evidence lower bound. We provide conditions under which they recover the data distribution and learn latent features, and formally show that common issues such as blurry samples and uninformative latent features arise when these conditions are not met. Based on these new insights, we propose a new sequential VAE model that can generate sharp samples on the LSUN image dataset based on pixel-wise reconstruction loss, and propose an optimization criterion that encourages unsupervised learning of informative latent features.
