Table of Contents
Fetching ...

Provably Scalable Black-Box Variational Inference with Structured Variational Families

Joohwan Ko, Kyurae Kim, Woo Chang Kim, Jacob R. Gardner

TL;DR

This work investigates the scalability limitations of BBVI when using full-rank variational covariances and demonstrates that structured location-scale variational families can dramatically improve computational efficiency. By focusing on triangular/bordered block-diagonal scale structures and employing proximal SGD, the authors prove $\mathcal{O}(N)$ iteration complexity for finite-sum hierarchical models, and validate these results with large-scale experiments on models with local variables. A key contribution is the formalization of hierarchical branched distributions and the analysis showing how gradient variance depends on an effective dimensionality $d^*$, which can be reduced by structure. The findings offer a principled trade-off between posterior expressiveness and computational tractability, with practical implications for scalable Bayesian inference in complex hierarchical settings.

Abstract

Variational families with full-rank covariance approximations are known not to work well in black-box variational inference (BBVI), both empirically and theoretically. In fact, recent computational complexity results for BBVI have established that full-rank variational families scale poorly with the dimensionality of the problem compared to e.g. mean-field families. This is particularly critical to hierarchical Bayesian models with local variables; their dimensionality increases with the size of the datasets. Consequently, one gets an iteration complexity with an explicit $\mathcal{O}(N^2)$ dependence on the dataset size $N$. In this paper, we explore a theoretical middle ground between mean-field variational families and full-rank families: structured variational families. We rigorously prove that certain scale matrix structures can achieve a better iteration complexity of $\mathcal{O}\left(N\right)$, implying better scaling with respect to $N$. We empirically verify our theoretical results on large-scale hierarchical models.

Provably Scalable Black-Box Variational Inference with Structured Variational Families

TL;DR

This work investigates the scalability limitations of BBVI when using full-rank variational covariances and demonstrates that structured location-scale variational families can dramatically improve computational efficiency. By focusing on triangular/bordered block-diagonal scale structures and employing proximal SGD, the authors prove iteration complexity for finite-sum hierarchical models, and validate these results with large-scale experiments on models with local variables. A key contribution is the formalization of hierarchical branched distributions and the analysis showing how gradient variance depends on an effective dimensionality , which can be reduced by structure. The findings offer a principled trade-off between posterior expressiveness and computational tractability, with practical implications for scalable Bayesian inference in complex hierarchical settings.

Abstract

Variational families with full-rank covariance approximations are known not to work well in black-box variational inference (BBVI), both empirically and theoretically. In fact, recent computational complexity results for BBVI have established that full-rank variational families scale poorly with the dimensionality of the problem compared to e.g. mean-field families. This is particularly critical to hierarchical Bayesian models with local variables; their dimensionality increases with the size of the datasets. Consequently, one gets an iteration complexity with an explicit dependence on the dataset size . In this paper, we explore a theoretical middle ground between mean-field variational families and full-rank families: structured variational families. We rigorously prove that certain scale matrix structures can achieve a better iteration complexity of , implying better scaling with respect to . We empirically verify our theoretical results on large-scale hierarchical models.
Paper Structure (71 sections, 7 theorems, 141 equations, 13 figures, 2 tables)

This paper contains 71 sections, 7 theorems, 141 equations, 13 figures, 2 tables.

Key Result

Theorem 1

Let $\ell$ be $\mu$-strongly convex and $L$-smooth. Then, the iteration complexity of being $\epsilon$-close to the global minimizer with proximal SGD BBVI is where $\kappa = L / \mu$, $\Delta_0 = {\left\lVert \vlambda_0 - \vlambda^* \right\rVert}_2$ is the distance between the initial point $\vlambda_0$ and the global optimum $\vlambda^* = \mathop{\mathrm{arg\,min}}\limits_{\vlambda \in \Lambda}

Figures (13)

  • Figure 1: Visualization of $\mC$ under the proposed structure. The colored entries are non-zero, while the white entries are filled with zeros.
  • Figure 2: Number of iterations $T$ required to obtain $\epsilon$ accuracy of variational families for a given stepsize $\gamma$.structured behaves similarly to mean-field, while full-rank requires significantly more number of iterations, which also scales worse with respect to the number of datapoints $n$.
  • Figure 3: Scaling of variational families with respect to the number of datapoints $n$. full-rank exhibits a worst scaling than structured and mean-field.
  • Figure 4: ELBO at $T = 5 \times 10^4$ versus the optimizer stepsize ($\gamma$) on the considered problems with varying dataset sizes. The solid lines are the median over 8 independent replications, while the colored bands mark the 80% empirical percentiles.
  • Figure 5: ELBO versus stepsize on rpoisson-small The solid lines are the median, while the shaded regions are the 80% quantiles computed from 4 independent replications. Notice that the performance gap between full-rank and structured becomes narrower as we reduce the stepsize.
  • ...and 8 more figures

Theorems & Definitions (29)

  • Definition 1: Location-Scale Family
  • Theorem 1: domke_provable_2023kim_convergence_2023
  • Corollary : Informal
  • Remark 1: Sample Complexity
  • Remark 2: Are fewer parameters obviously better?
  • Remark 3: Where does $d$ come from?
  • Remark 4
  • Remark 5
  • Remark 6
  • Corollary 1
  • ...and 19 more