Table of Contents
Fetching ...

Zero-Variance Gradients for Variational Autoencoders

Zilei Shao, Anji Liu, Guy Van den Broeck

TL;DR

This paper introduces a training paradigm that uses the analytic gradient to guide early encoder learning before annealing to a standard stochastic estimator, and suggests that architectural choices enabling analytic expectation computation can significantly stabilize the training of generative models with stochastic components.

Abstract

Training deep generative models like Variational Autoencoders (VAEs) requires propagating gradients through stochastic latent variables, which introduces estimation variance that can slow convergence and degrade performance. In this paper, we explore an orthogonal direction, which we call Silent Gradients. Instead of designing improved stochastic estimators, we show that by restricting the decoder architecture in specific ways, the expected ELBO can be computed analytically. This yields gradients with zero estimation variance as we can directly compute the evidence lower-bound without resorting to Monte Carlo samples of the latent variables. We first provide a theoretical analysis in a controlled setting with a linear decoder and demonstrate improved optimization compared to standard estimators. To extend this idea to expressive nonlinear decoders, we introduce a training paradigm that uses the analytic gradient to guide early encoder learning before annealing to a standard stochastic estimator. Across multiple datasets, our approach consistently improves established baselines, including reparameterization, Gumbel-Softmax, and REINFORCE. These results suggest that architectural choices enabling analytic expectation computation can significantly stabilize the training of generative models with stochastic components.

Zero-Variance Gradients for Variational Autoencoders

TL;DR

This paper introduces a training paradigm that uses the analytic gradient to guide early encoder learning before annealing to a standard stochastic estimator, and suggests that architectural choices enabling analytic expectation computation can significantly stabilize the training of generative models with stochastic components.

Abstract

Training deep generative models like Variational Autoencoders (VAEs) requires propagating gradients through stochastic latent variables, which introduces estimation variance that can slow convergence and degrade performance. In this paper, we explore an orthogonal direction, which we call Silent Gradients. Instead of designing improved stochastic estimators, we show that by restricting the decoder architecture in specific ways, the expected ELBO can be computed analytically. This yields gradients with zero estimation variance as we can directly compute the evidence lower-bound without resorting to Monte Carlo samples of the latent variables. We first provide a theoretical analysis in a controlled setting with a linear decoder and demonstrate improved optimization compared to standard estimators. To extend this idea to expressive nonlinear decoders, we introduce a training paradigm that uses the analytic gradient to guide early encoder learning before annealing to a standard stochastic estimator. Across multiple datasets, our approach consistently improves established baselines, including reparameterization, Gumbel-Softmax, and REINFORCE. These results suggest that architectural choices enabling analytic expectation computation can significantly stabilize the training of generative models with stochastic components.

Paper Structure

This paper contains 40 sections, 2 theorems, 41 equations, 5 figures, 9 tables, 1 algorithm.

Key Result

Proposition 1

Let $\boldsymbol{z}\in \mathbb{R}^d$ be a random vector with independent components $z_i$. The first four central moments of each component, $\mathbb{E}[\tilde{z_i}]:=\mathbb{E}[(z_i-\mathbb{E}[z_i])^k]$ for $k\in\{1,2,3,4\}$, can be computed in closed form of the parameters of its distribution if $

Figures (5)

  • Figure 1: Illustration of the use of Silent Gradients in training VAEs. The encoder ($E_\phi$) takes input $\boldsymbol{x}$ and infers a latent distribution $q_\phi(\boldsymbol{z}|\boldsymbol{x})$. These parameters are fed directly to the linear decoder ($D_{lin}$), which computes the analytical reconstruction log-likelihood, yielding a noise-free (Silent) gradient (dashed teal arrow) used to train the encoder. In parallel, samples $\boldsymbol{z}'$ are drawn from the latent distribution and fed to the nonlinear decoder ($D_{nl}$), which produces a standard, sample-based loss, resulting in a noisy gradient (dashed orange arrow). The solid black arrows represent the forward pass, while the dashed teal and orange arrows indicate the flow of gradients. During training, we can choose to train the encoder solely with the Silent Gradients or combine it with the noisy gradient using an annealing schedule. At inference time, only the trained encoder $E_\phi$ and nonlinear decoder $D_{nl}$ are used.
  • Figure 2: Visual comparison of reconstructions for the fixed variance experiment on MNIST. The top row (a) displays original images from the validation set. Subsequent rows show the reconstructed means from our Silent Gradients method and the baseline estimators for both continuous and discrete latent spaces.
  • Figure 3: Reconstructions on the MNIST dataset in learnable variance experiment. The images are the output of the nonlinear decoder.
  • Figure 4: Reconstructions on the ImageNet dataset in learnable variance experiment.
  • Figure 5: Reconstructions on the CIFAR-10 dataset in learnable variance experiment.

Theorems & Definitions (4)

  • Proposition 1: Tractable Central Moments
  • proof : Derivation Sketch
  • Theorem 1: Analytic Covariance of Linear Projections
  • proof : Proof Sketch