Table of Contents
Fetching ...

Pathwise Gradient Variance Reduction with Control Variates in Variational Inference

Kenyon Ng, Susan Wei

TL;DR

This paper analyzes variance reduction for pathwise gradient estimators in variational inference, surveying existing CV approaches and introducing zero-variance CV (ZVCV) to relax assumptions on the variational distribution. While ZVCV enables use with complex reparameterizable models where mean/covariance are intractable, empirical results show that CV-based variance reduction rarely justifies its computational overhead in standard VI tasks; simply increasing the number of gradient samples often yields faster convergence and comparable ELBO improvements. The study highlights that gradient-variance reduction alone may not translate to better downstream metrics, motivating future exploration of ZVCV in generative or energy-based/implicit VI settings. Overall, the work clarifies when CV methods help and when they do not, and provides practical guidance on the trade-offs involved in variance reduction for VI.

Abstract

Variational inference in Bayesian deep learning often involves computing the gradient of an expectation that lacks a closed-form solution. In these cases, pathwise and score-function gradient estimators are the most common approaches. The pathwise estimator is often favoured for its substantially lower variance compared to the score-function estimator, which typically requires variance reduction techniques. However, recent research suggests that even pathwise gradient estimators could benefit from variance reduction. In this work, we review existing control-variates-based variance reduction methods for pathwise gradient estimators to assess their effectiveness. Notably, these methods often rely on integrand approximations and are applicable only to simple variational families. To address this limitation, we propose applying zero-variance control variates to pathwise gradient estimators. This approach offers the advantage of requiring minimal assumptions about the variational distribution, other than being able to sample from it.

Pathwise Gradient Variance Reduction with Control Variates in Variational Inference

TL;DR

This paper analyzes variance reduction for pathwise gradient estimators in variational inference, surveying existing CV approaches and introducing zero-variance CV (ZVCV) to relax assumptions on the variational distribution. While ZVCV enables use with complex reparameterizable models where mean/covariance are intractable, empirical results show that CV-based variance reduction rarely justifies its computational overhead in standard VI tasks; simply increasing the number of gradient samples often yields faster convergence and comparable ELBO improvements. The study highlights that gradient-variance reduction alone may not translate to better downstream metrics, motivating future exploration of ZVCV in generative or energy-based/implicit VI settings. Overall, the work clarifies when CV methods help and when they do not, and provides practical guidance on the trade-offs involved in variance reduction for VI.

Abstract

Variational inference in Bayesian deep learning often involves computing the gradient of an expectation that lacks a closed-form solution. In these cases, pathwise and score-function gradient estimators are the most common approaches. The pathwise estimator is often favoured for its substantially lower variance compared to the score-function estimator, which typically requires variance reduction techniques. However, recent research suggests that even pathwise gradient estimators could benefit from variance reduction. In this work, we review existing control-variates-based variance reduction methods for pathwise gradient estimators to assess their effectiveness. Notably, these methods often rely on integrand approximations and are applicable only to simple variational families. To address this limitation, we propose applying zero-variance control variates to pathwise gradient estimators. This approach offers the advantage of requiring minimal assumptions about the variational distribution, other than being able to sample from it.
Paper Structure (35 sections, 18 equations, 14 figures, 2 algorithms)

This paper contains 35 sections, 18 equations, 14 figures, 2 algorithms.

Figures (14)

  • Figure 1: ELBO is plotted against wall-clock time for different numbers of gradient samples $L$ and two families of $q$. The bold lines represent the median of ELBO values recorded at the same iteration across five repetitions. The shaded area illustrates the range of ELBO values across five repetitions. The ELBO values are smoothed using an exponential moving average. A higher ELBO indicates better performance. See Figure \ref{['fig:mean-elbo-time']} for plots where the bold lines represent the mean ELBO.
  • Figure 2: ELBO is plotted against the number of gradient descent steps for different numbers of gradient samples $L$ and two families of $q$. The bold lines represent the median of ELBO values recorded at the same iteration across five repetitions. The shaded area illustrates the range of ELBO values across five repetitions. The ELBO values are smoothed using an exponential moving average. The trajectories of ZVCV-GD and NoCV are nearly identical in both full-batch and mini-batch BNN when $L=10$. A higher ELBO indicates better performance. See Figure \ref{['fig:mean-elbo-iter']} for plots where the bold lines represent the mean ELBO.
  • Figure 3: We present the variance ratio $\mathbb{V}[\hat{h}] / \mathbb{V}[\hat{g}]$, where $\hat{g}$ is NoCV and $\hat{h}$ is either ZVCV-GD or QuadCV, at each iteration. We show only the median variance ratios recorded at the same iteration across five repetitions, omitting the individual variance ratios from each repetition to prevent clutter in the plots. The ratios from mean-field Gaussian and real NVP are shown in top and bottom rows respectively. Note that NoCV (in red) is always 1 by definition. We see that ZVCV-GD (in blue) struggles to reduce variance in the BNN models. There is also a significant overlap in QuadCV between $L = 10$ (solid green) and $L = 50$ (dotted green). A lower ratio indicates better performance. See Figure \ref{['fig:mean-vr']} for plots where the bold lines represent the mean variance ratios.
  • Figure 4: ELBO is plotted against gradient descent steps and wall-clock time for varying numbers of gradient samples $L$ using rank-5 Gaussian. The bold lines represent the median of ELBO values recorded at the same iteration across five repetitions. The ELBO values have been smoothed using an exponential moving average. A higher ELBO indicates better performance. See Figure \ref{['fig:mean-elbo-diaglr']} for plots where the bold lines represent the mean ELBO.
  • Figure 5: We present the variance ratio $\mathbb{V}[\hat{h}] / \mathbb{V}[\hat{g}]$ of rank-5 Gaussian, where $\hat{g}$ is NoCV and $\hat{h}$ is either ZVCV-GD or QuadCV, at each iteration. We show only the median variance ratios recorded at the same iteration across five repetitions, omitting the individual variance ratios from each repetition to prevent clutter in the plots. Note that NoCV (in red) is always 1 by definition. We see that ZVCV-GD (in blue) struggles to reduce variance in the BNN models. There is also some overlap between $L = 10$ (solid green) and $L = 50$ (dotted green). A lower ratio indicates better performance. A lower ratio indicates better performance. See Figure \ref{['fig:mean-vr-diaglr']} for plots where the bold lines represent the mean variance ratios.
  • ...and 9 more figures