Table of Contents
Fetching ...

Backpropagation through the Void: Optimizing control variates for black-box gradient estimation

Will Grathwohl, Dami Choi, Yuhuai Wu, Geoffrey Roeder, David Duvenaud

TL;DR

The paper introduces LAX and RELAX, a neural-surrogate framework for unbiased, low-variance gradient estimation of black-box objectives over random variables. By combining the score-function estimator, the reparameterization trick, and a differentiable control variate, these methods extend to continuous and discrete variables (including conditional reparameterization via Gumbel-softmax) and enable action-dependent baselines in reinforcement learning. The approach is demonstrated on toy problems, discrete VAEs, and RL benchmarks, showing faster convergence and reduced gradient variance compared to standard baselines. This framework broadens the applicability of gradient-based optimization to non-differentiable or unknown objectives and offers practical improvements for training models with discrete latent variables and complex controllers.

Abstract

Gradient-based optimization is the foundation of deep learning and reinforcement learning. Even when the mechanism being optimized is unknown or not differentiable, optimization using high-variance or biased gradient estimates is still often the best strategy. We introduce a general framework for learning low-variance, unbiased gradient estimators for black-box functions of random variables. Our method uses gradients of a neural network trained jointly with model parameters or policies, and is applicable in both discrete and continuous settings. We demonstrate this framework for training discrete latent-variable models. We also give an unbiased, action-conditional extension of the advantage actor-critic reinforcement learning algorithm.

Backpropagation through the Void: Optimizing control variates for black-box gradient estimation

TL;DR

The paper introduces LAX and RELAX, a neural-surrogate framework for unbiased, low-variance gradient estimation of black-box objectives over random variables. By combining the score-function estimator, the reparameterization trick, and a differentiable control variate, these methods extend to continuous and discrete variables (including conditional reparameterization via Gumbel-softmax) and enable action-dependent baselines in reinforcement learning. The approach is demonstrated on toy problems, discrete VAEs, and RL benchmarks, showing faster convergence and reduced gradient variance compared to standard baselines. This framework broadens the applicability of gradient-based optimization to non-differentiable or unknown objectives and offers practical improvements for training models with discrete latent variables and complex controllers.

Abstract

Gradient-based optimization is the foundation of deep learning and reinforcement learning. Even when the mechanism being optimized is unknown or not differentiable, optimization using high-variance or biased gradient estimates is still often the best strategy. We introduce a general framework for learning low-variance, unbiased gradient estimators for black-box functions of random variables. Our method uses gradients of a neural network trained jointly with model parameters or policies, and is applicable in both discrete and continuous settings. We demonstrate this framework for training discrete latent-variable models. We also give an unbiased, action-conditional extension of the advantage actor-critic reinforcement learning algorithm.

Paper Structure

This paper contains 33 sections, 2 theorems, 27 equations, 7 figures, 4 tables, 2 algorithms.

Key Result

Theorem C.1

The LAX estimator, is unbiased.

Figures (7)

  • Figure 1: Left: Training curves comparing different gradient estimators on a toy problem: ${\mathcal{L}(\theta) = \mathbb{E}_{p(b|\theta)} [ (b - 0.499)^2 ]}$Right: Log-variance of each estimator's gradient.
  • Figure 2: Histograms of samples from the gradient estimators that create LAX. Samples generated from our one-layer VAE experiments (Section \ref{['vae section']}).
  • Figure 3: The optimal relaxation for a toy loss function, using different gradient estimators. Because REBAR uses the concrete relaxation of $f$, which happens to be implemented as a quadratic function, the optimal relaxation is constrained to be a warped quadratic. In contrast, RELAX can choose a free-form relaxation.
  • Figure 4: Training curves for the VAE Experiments with the one-layer linear model. The horizontal dashed line indicates the lowest validation error obtained by REBAR.
  • Figure 5: Top row: Reward curves. Bottom row: Log-variance of policy gradients. In each curve, the center line indicates the mean reward over 5 random seeds. The opaque bars in the top row indicate the 25th and 75th percentiles. The opaque bars in the bottom row indicate 1 standard deviation. Since the gradient estimator is defined at the end of each episode, we display log-variance per episode. After every 10th training episode 100 episodes were run and the sample log-variance is reported averaged over all policy parameters.
  • ...and 2 more figures

Theorems & Definitions (5)

  • proof
  • Theorem C.1
  • proof
  • Theorem C.2
  • proof