Table of Contents
Fetching ...

Diagnosing and Enhancing VAE Models

Bin Dai, David Wipf

TL;DR

The paper challenges the view that Gaussian encoder/decoder choices inherently limit VAE performance by showing that, in principle, the ground-truth distribution can be recovered under optimality in the $r=d$ regime and that the optimal $r<d$ case yields near-correct manifold mass with non-unique solutions. It then introduces a simple, practical two-stage VAE enhancement that first learns a low-dimensional manifold and then learns the distribution on that manifold, achieving crisp samples and competitive Fréchet Inception Distances (FID) with some GANs under a neutral architecture and without extra hyperparameters. Through theoretical results and extensive experiments on MNIST, Fashion-MNIST, CIFAR-10, and CelebA, the work demonstrates that this two-stage approach reduces the mismatch between the aggregated posterior and a standard Gaussian, yields stable sampling, and remains robust to latent-dimension choices. Altogether, the method provides a principled route to improve VAE-based generative modeling, narrowing the gap to GANs while preserving VAE advantages such as stable training and interpretable inference.

Abstract

Although variational autoencoders (VAEs) represent a widely influential deep generative model, many aspects of the underlying energy function remain poorly understood. In particular, it is commonly believed that Gaussian encoder/decoder assumptions reduce the effectiveness of VAEs in generating realistic samples. In this regard, we rigorously analyze the VAE objective, differentiating situations where this belief is and is not actually true. We then leverage the corresponding insights to develop a simple VAE enhancement that requires no additional hyperparameters or sensitive tuning. Quantitatively, this proposal produces crisp samples and stable FID scores that are actually competitive with a variety of GAN models, all while retaining desirable attributes of the original VAE architecture. A shorter version of this work will appear in the ICLR 2019 conference proceedings (Dai and Wipf, 2019). The code for our model is available at https://github.com/daib13/ TwoStageVAE.

Diagnosing and Enhancing VAE Models

TL;DR

The paper challenges the view that Gaussian encoder/decoder choices inherently limit VAE performance by showing that, in principle, the ground-truth distribution can be recovered under optimality in the regime and that the optimal case yields near-correct manifold mass with non-unique solutions. It then introduces a simple, practical two-stage VAE enhancement that first learns a low-dimensional manifold and then learns the distribution on that manifold, achieving crisp samples and competitive Fréchet Inception Distances (FID) with some GANs under a neutral architecture and without extra hyperparameters. Through theoretical results and extensive experiments on MNIST, Fashion-MNIST, CIFAR-10, and CelebA, the work demonstrates that this two-stage approach reduces the mismatch between the aggregated posterior and a standard Gaussian, yields stable sampling, and remains robust to latent-dimension choices. Altogether, the method provides a principled route to improve VAE-based generative modeling, narrowing the gap to GANs while preserving VAE advantages such as stable training and interpretable inference.

Abstract

Although variational autoencoders (VAEs) represent a widely influential deep generative model, many aspects of the underlying energy function remain poorly understood. In particular, it is commonly believed that Gaussian encoder/decoder assumptions reduce the effectiveness of VAEs in generating realistic samples. In this regard, we rigorously analyze the VAE objective, differentiating situations where this belief is and is not actually true. We then leverage the corresponding insights to develop a simple VAE enhancement that requires no additional hyperparameters or sensitive tuning. Quantitatively, this proposal produces crisp samples and stable FID scores that are actually competitive with a variety of GAN models, all while retaining desirable attributes of the original VAE architecture. A shorter version of this work will appear in the ICLR 2019 conference proceedings (Dai and Wipf, 2019). The code for our model is available at https://github.com/daib13/ TwoStageVAE.

Paper Structure

This paper contains 30 sections, 4 theorems, 69 equations, 19 figures, 4 tables.

Key Result

Theorem 2

Suppose that $r = d$ and there exists a density $p_{gt}(\hbox{\boldmath $x$})$ associated with the ground-truth measure $\mu_{gt}$ that is nonzero everywhere on $\mathbb{R}^d$.This nonzero assumption can be replaced with a much looser condition. Specifically, if there exists a diffeomorphism between

Figures (19)

  • Figure 1: Validation of Theorem \ref{['thm:decoder_variance']}. (a) The red line shows the evolution of $\log\gamma$, converging close to $0$ during training as expected. The two blue curves compare the associated pixel-wise reconstruction errors with $\gamma$ fixed at $1$ and with a learnable $\gamma$ respectively. (b) The FID score obtained using reconstructed images from various VAE models (reconstructed image FID is another way of evaluating reconstruction quality; it is distinct from measuring generated sample quality via FID scores). In general, the VAE with learnable $\gamma$ produces the best reconstructions as expected.
  • Figure 2: Validation of Theorem \ref{['thm:decoder_mean']}. The $j$-th eigenvalue of $\hbox{\boldmath $\Sigma$}_z$, denoted $\lambda_j$, should be very close to either $0$ or $1$ as argued in Section \ref{['sec:optima_property']}. When $\lambda_j$ is close to $0$, injecting noise along the corresponding direction will cause a large variance in the reconstructed image, meaning this direction is an informative one needed for representing the manifold. In contrast, if $\lambda_j$ is close to $1$, the addition of noise does not make any appreciable difference in the reconstructed image, indicating that the corresponding dimension is a superfluous one that has been ignored/blocked by the decoder.
  • Figure 3: Singular value spectrums of latent sample matrices drawn from $q_\phi(\hbox{\boldmath $z$})$ (first stage) and $q_{\phi^\prime}(\hbox{\boldmath $u$})$ (enhanced second stage).
  • Figure 4: Maximum mean discrepancy between $\mathcal{N}(0,\hbox{\boldmath $I$})$ and $q_\phi(\hbox{\boldmath $z$})$ (first stage); likewise for $q_{\phi^\prime}(\hbox{\boldmath $u$})$ (second stage).
  • Figure 5: FID Score w.r.t. Different Latent Dimensions. (Left) Reconstruction FID. (Right) Generation FID.
  • ...and 14 more figures

Theorems & Definitions (5)

  • Definition 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5