Table of Contents
Fetching ...

On the Convergence of Black-Box Variational Inference

Kyurae Kim, Jisu Oh, Kaiwen Wu, Yi-An Ma, Jacob R. Gardner

TL;DR

This work delivers the first convergence guarantees for full black-box variational inference (BBVI) as used in practice, covering reparameterization gradients with the location-scale variational family on log-smooth posteriors. It reveals that nonlinear scale parameterizations can destroy strong convexity and slow convergence, while proximal SGD reinstates strong guarantees and achieves the fastest known rates for stochastic first-order methods in this setting. The authors provide a detailed theoretical analysis of ELBO smoothness and convexity under nonlinear parameterizations, propose a generalized gradient-variance framework, and prove convergence results for proximal BBVI with 1-Lipschitz diagonally conditioned scales. They validate the theory empirically, showing proximal BBVI outperforms standard BBVI and nonlinear parameterizations on both synthetic and large-scale realistic problems. The findings highlight practical guidance for BBVI design and establish a rigorous foundation for convergence in probabilistic programming contexts.

Abstract

We provide the first convergence guarantee for full black-box variational inference (BBVI), also known as Monte Carlo variational inference. While preliminary investigations worked on simplified versions of BBVI (e.g., bounded domain, bounded support, only optimizing for the scale, and such), our setup does not need any such algorithmic modifications. Our results hold for log-smooth posterior densities with and without strong log-concavity and the location-scale variational family. Also, our analysis reveals that certain algorithm design choices commonly employed in practice, particularly, nonlinear parameterizations of the scale of the variational approximation, can result in suboptimal convergence rates. Fortunately, running BBVI with proximal stochastic gradient descent fixes these limitations, and thus achieves the strongest known convergence rate guarantees. We evaluate this theoretical insight by comparing proximal SGD against other standard implementations of BBVI on large-scale Bayesian inference problems.

On the Convergence of Black-Box Variational Inference

TL;DR

This work delivers the first convergence guarantees for full black-box variational inference (BBVI) as used in practice, covering reparameterization gradients with the location-scale variational family on log-smooth posteriors. It reveals that nonlinear scale parameterizations can destroy strong convexity and slow convergence, while proximal SGD reinstates strong guarantees and achieves the fastest known rates for stochastic first-order methods in this setting. The authors provide a detailed theoretical analysis of ELBO smoothness and convexity under nonlinear parameterizations, propose a generalized gradient-variance framework, and prove convergence results for proximal BBVI with 1-Lipschitz diagonally conditioned scales. They validate the theory empirically, showing proximal BBVI outperforms standard BBVI and nonlinear parameterizations on both synthetic and large-scale realistic problems. The findings highlight practical guidance for BBVI design and establish a rigorous foundation for convergence in probabilistic programming contexts.

Abstract

We provide the first convergence guarantee for full black-box variational inference (BBVI), also known as Monte Carlo variational inference. While preliminary investigations worked on simplified versions of BBVI (e.g., bounded domain, bounded support, only optimizing for the scale, and such), our setup does not need any such algorithmic modifications. Our results hold for log-smooth posterior densities with and without strong log-concavity and the location-scale variational family. Also, our analysis reveals that certain algorithm design choices commonly employed in practice, particularly, nonlinear parameterizations of the scale of the variational approximation, can result in suboptimal convergence rates. Fortunately, running BBVI with proximal stochastic gradient descent fixes these limitations, and thus achieves the strongest known convergence rate guarantees. We evaluate this theoretical insight by comparing proximal SGD against other standard implementations of BBVI on large-scale Bayesian inference problems.
Paper Structure (68 sections, 2 theorems, 146 equations, 7 figures, 2 tables)

This paper contains 68 sections, 2 theorems, 146 equations, 7 figures, 2 tables.

Key Result

Corollary 1

Let $\ell$ be $L_{\ell}$-smooth and assumption:peculiar_smoothness hold. Furthermore, let the diagonal conditioner be 1-Lipschitz continuous, and $L_{\phi}$-log-smooth. Then, the ELBO is $(L_{\ell} + L_s + L_{\phi})$-smooth.

Figures (7)

  • Figure 1: Taxonomy of variational inference. Within BBVI, this work only considers the reparameterization gradient ($\mathrm{BBVI} \,\cap\, \mathrm{RP}$, shown in dark red). This leaves out BBVI with the score gradient ($\mathrm{BBVI} \setminus \mathrm{RP}$, shown in light red). The set $\mathrm{VI} \,\cap\, \mathrm{FS}$ includes sparse variational Gaussian processes titsias_variational_2009, while the remaining set $\mathrm{VI} \setminus \left( \mathrm{FS} \cup \mathrm{IS} \cup \mathrm{RP} \right)$ includes coordinate ascent VI blei_variational_2017.
  • Figure 2: Optimization landscape resulting from different $\phi$ on a strongly-convex $\ell$. $\ell$ is the counter-example of \ref{['thm:gradient_covariance_sign']}\ref{['thm:gradient_covariance_sign_item2']}. $\phi(x) = x$ preserves strong convexity as shown by the lower-bounding quadratic (red dotted line ). $\phi = \operatorname{softplus}$ violates the first-order condition of convexity (black dotted line ).
  • Figure 3: Stepsize versus the number of iterations for vanilla SGD and proximal SGD to achieve $\mathrm{D}_{\mathrm{KL}}(q_{\vlambda},\pi) \leq \epsilon = 1$ under different initializations for Gaussian posteriors. The initializations $C\left(\vlambda_0\right)$ are $\mathbf{I}$, $10^{-3}\mathbf{I}$, $10^{-5}\mathbf{I}$ from left to right, respectively. The average suboptimality at iteration $t$ was estimated from 10 independent runs. For each run, the target posterior was a 10-dimensional Gaussian with a covariance with a condition number $\kappa = 10$ and a smoothness of $L = 100$.
  • Figure 4: Comparison of BBVI convergence speed (ELBO v.s. Iteration) of different optimization algorithms. The error bands are the 80% quantiles estimated from 20 (10 for AR-eeg) independent replications. The results shown used a base stepsize of $\gamma = 10^{-3}$, while the initial point was $\vm_0 = \mathbf{0}, \mC_0 = \mathbf{I}$. Details on the setup can be found in the text of \ref{['section:realistic_problems']} and \ref{['appendix:experimental_setup']}.
  • Figure 5: ProxGen-Adam for Black-Box Variational Inference
  • ...and 2 more figures

Theorems & Definitions (25)

  • Definition 1: Reparameterized Family
  • Definition 2: Location-Scale Reparameterization Function
  • Definition 3: Mean-Field Family.
  • Definition 4: Full-Rank Cholesky Family
  • Example 1
  • Remark 1
  • Remark 2
  • Corollary 1: Smoothness of the ELBO
  • Remark 3
  • Lemma 1: domke_provable_2019kim_practical_2023
  • ...and 15 more