Pathwise Gradient Variance Reduction with Control Variates in Variational Inference
Kenyon Ng, Susan Wei
TL;DR
This paper analyzes variance reduction for pathwise gradient estimators in variational inference, surveying existing CV approaches and introducing zero-variance CV (ZVCV) to relax assumptions on the variational distribution. While ZVCV enables use with complex reparameterizable models where mean/covariance are intractable, empirical results show that CV-based variance reduction rarely justifies its computational overhead in standard VI tasks; simply increasing the number of gradient samples often yields faster convergence and comparable ELBO improvements. The study highlights that gradient-variance reduction alone may not translate to better downstream metrics, motivating future exploration of ZVCV in generative or energy-based/implicit VI settings. Overall, the work clarifies when CV methods help and when they do not, and provides practical guidance on the trade-offs involved in variance reduction for VI.
Abstract
Variational inference in Bayesian deep learning often involves computing the gradient of an expectation that lacks a closed-form solution. In these cases, pathwise and score-function gradient estimators are the most common approaches. The pathwise estimator is often favoured for its substantially lower variance compared to the score-function estimator, which typically requires variance reduction techniques. However, recent research suggests that even pathwise gradient estimators could benefit from variance reduction. In this work, we review existing control-variates-based variance reduction methods for pathwise gradient estimators to assess their effectiveness. Notably, these methods often rely on integrand approximations and are applicable only to simple variational families. To address this limitation, we propose applying zero-variance control variates to pathwise gradient estimators. This approach offers the advantage of requiring minimal assumptions about the variational distribution, other than being able to sample from it.
