Table of Contents
Fetching ...

Variance Control for Black Box Variational Inference Using The James-Stein Estimator

Dominic B. Dayta

TL;DR

This paper addresses instability and tuning challenges in Black Box Variational Inference (BBVI) arising from high-variance ELBO gradient estimates. It reframes BBVI updates as a multivariate estimation problem and introduces the Positive-Part James-Stein shrinkage to the gradient estimator (BBVI-JS+), achieving variance control without requiring explicit factorization of the variational family. Theoretical results show JS+ can dominate the naive gradient in mean-squared error, while practical experiments on Gaussian mixtures and benchmark datasets demonstrate stable convergence and competitive model fit relative to Rao-Blackwellized BBVI, with robust performance across varying model sizes. The approach offers a simple, black-box-compatible variance reduction that can be integrated with RMSProp and applied to large-scale Bayesian models, potentially broadening BBVI's applicability and reliability.

Abstract

Black Box Variational Inference is a promising framework in a succession of recent efforts to make Variational Inference more ``black box". However, in basic version it either fails to converge due to instability or requires some fine-tuning of the update steps prior to execution that hinder it from being completely general purpose. We propose a method for regulating its parameter updates by reframing stochastic gradient ascent as a multivariate estimation problem. We examine the properties of the James-Stein estimator as a replacement for the arithmetic mean of Monte Carlo estimates of the gradient of the evidence lower bound. The proposed method provides relatively weaker variance reduction than Rao-Blackwellization, but offers a tradeoff of being simpler and requiring no fine tuning on the part of the analyst. Performance on benchmark datasets also demonstrate a consistent performance at par or better than the Rao-Blackwellized approach in terms of model fit and time to convergence.

Variance Control for Black Box Variational Inference Using The James-Stein Estimator

TL;DR

This paper addresses instability and tuning challenges in Black Box Variational Inference (BBVI) arising from high-variance ELBO gradient estimates. It reframes BBVI updates as a multivariate estimation problem and introduces the Positive-Part James-Stein shrinkage to the gradient estimator (BBVI-JS+), achieving variance control without requiring explicit factorization of the variational family. Theoretical results show JS+ can dominate the naive gradient in mean-squared error, while practical experiments on Gaussian mixtures and benchmark datasets demonstrate stable convergence and competitive model fit relative to Rao-Blackwellized BBVI, with robust performance across varying model sizes. The approach offers a simple, black-box-compatible variance reduction that can be integrated with RMSProp and applied to large-scale Bayesian models, potentially broadening BBVI's applicability and reliability.

Abstract

Black Box Variational Inference is a promising framework in a succession of recent efforts to make Variational Inference more ``black box". However, in basic version it either fails to converge due to instability or requires some fine-tuning of the update steps prior to execution that hinder it from being completely general purpose. We propose a method for regulating its parameter updates by reframing stochastic gradient ascent as a multivariate estimation problem. We examine the properties of the James-Stein estimator as a replacement for the arithmetic mean of Monte Carlo estimates of the gradient of the evidence lower bound. The proposed method provides relatively weaker variance reduction than Rao-Blackwellization, but offers a tradeoff of being simpler and requiring no fine tuning on the part of the analyst. Performance on benchmark datasets also demonstrate a consistent performance at par or better than the Rao-Blackwellized approach in terms of model fit and time to convergence.
Paper Structure (13 sections, 6 theorems, 31 equations, 3 figures, 1 table, 3 algorithms)

This paper contains 13 sections, 6 theorems, 31 equations, 3 figures, 1 table, 3 algorithms.

Key Result

Theorem 1

BBVI-Naive, which we now denote as $\hat{\mu}_{MLE}$ is the Maximum Likelihood estimator of $\mu = \nabla_\lambda \mathcal{L}$, where for $z_s = \nabla_{\lambda} \log q(\theta[s] | \lambda) (\log p(y, \theta[s]) - \log q(\theta[s] | \lambda))$. Furthermore, $\hat{\mu}_{MLE}$ is unbiased to the true gradient $\mu$.

Figures (3)

  • Figure 1: Relationship between gradient clipping and the James-Stein estimator. Gradient clipping $G_c$ preserves values of the gradient only up to $|| f ||^2 \leq c$. The Positive Part James-Stein operator $JS+$ penalizes $||f||^2$ for being close to $c$ and forces it towards zero.
  • Figure 2: Resulting estimator variances from the Gaussian Mixture experiment with $K = 2$ to $10$ components. BBVI-JS+ produces controlled variances in its sampling distribution for the ELBO gradient compared to BBVI-Naive, but relative to BBVI-RB still grows with the number of parameters. Interestingly, BBVI-RB+ provides even stricter variance control over BBVI-RB.
  • Figure 3: Scatterplots of benchmark datasets from the FCPS package FCPS. From left to right: EngyTime, Lsun3D, and Tetra.

Theorems & Definitions (7)

  • Definition 1
  • Theorem 1: BBVI as MLE Estimator
  • Theorem 2: Positive Part James-Stein Estimator
  • Theorem 3: Variance Reduction of the James-Stein Estimator
  • Theorem 4
  • Corollary 1
  • Theorem 5