Backpropagation through the Void: Optimizing control variates for black-box gradient estimation
Will Grathwohl, Dami Choi, Yuhuai Wu, Geoffrey Roeder, David Duvenaud
TL;DR
The paper introduces LAX and RELAX, a neural-surrogate framework for unbiased, low-variance gradient estimation of black-box objectives over random variables. By combining the score-function estimator, the reparameterization trick, and a differentiable control variate, these methods extend to continuous and discrete variables (including conditional reparameterization via Gumbel-softmax) and enable action-dependent baselines in reinforcement learning. The approach is demonstrated on toy problems, discrete VAEs, and RL benchmarks, showing faster convergence and reduced gradient variance compared to standard baselines. This framework broadens the applicability of gradient-based optimization to non-differentiable or unknown objectives and offers practical improvements for training models with discrete latent variables and complex controllers.
Abstract
Gradient-based optimization is the foundation of deep learning and reinforcement learning. Even when the mechanism being optimized is unknown or not differentiable, optimization using high-variance or biased gradient estimates is still often the best strategy. We introduce a general framework for learning low-variance, unbiased gradient estimators for black-box functions of random variables. Our method uses gradients of a neural network trained jointly with model parameters or policies, and is applicable in both discrete and continuous settings. We demonstrate this framework for training discrete latent-variable models. We also give an unbiased, action-conditional extension of the advantage actor-critic reinforcement learning algorithm.
