Amortized Variational Inference: When and Why?

Charles C. Margossian; David M. Blei

Amortized Variational Inference: When and Why?

Charles C. Margossian, David M. Blei

TL;DR

This paper analyzes amortized variational inference (A-VI) as a general-purpose alternative to mean-field VI (F-VI). It derives necessary, sufficient, and verifiable conditions under which A-VI can achieve the same optimal solution as F-VI, showing that the ideal inference function exists for simple hierarchical models and can be extended by expanding the input domain to handle more complex structures such as time series. The study demonstrates that some models (e.g., simple hierarchical and saw time-series) allow A-VI to close the amortization gap with relatively compact inference mechanisms, while others (e.g., hidden Markov models) inherently resist closure even with expanded domains. Empirical results across linear, nonlinear, Bayesian neural networks, and time-series illustrate when A-VI matches F-VI and when it offers faster convergence, providing practical guidance on when to use A-VI and how to design the inference function. The findings support the viability of A-VI for full Bayesian inference in a broad class of models and highlight important edge cases and diagnostic tools for model and algorithm selection.

Abstract

In a probabilistic latent variable model, factorized (or mean-field) variational inference (F-VI) fits a separate parametric distribution for each latent variable. Amortized variational inference (A-VI) instead learns a common inference function, which maps each observation to its corresponding latent variable's approximate posterior. Typically, A-VI is used as a step in the training of variational autoencoders, however it stands to reason that A-VI could also be used as a general alternative to F-VI. In this paper we study when and why A-VI can be used for approximate Bayesian inference. We derive conditions on a latent variable model which are necessary, sufficient, and verifiable under which A-VI can attain F-VI's optimal solution, thereby closing the amortization gap. We prove these conditions are uniquely verified by simple hierarchical models, a broad class that encompasses many models in machine learning. We then show, on a broader class of models, how to expand the domain of AVI's inference function to improve its solution, and we provide examples, e.g. hidden Markov models, where the amortization gap cannot be closed.

Amortized Variational Inference: When and Why?

TL;DR

Abstract

Paper Structure (22 sections, 7 theorems, 48 equations, 9 figures)

This paper contains 22 sections, 7 theorems, 48 equations, 9 figures.

INTRODUCTION
Plan.
Related work.
PRELIMINARIES
EXISTENCE OF AN IDEAL INFERENCE FUNCTION
EXAMPLE: LINEAR PROBABILISTIC MODEL
Further factorizations of $p(\theta, \mathbf{z}, \mathbf{x})$
NUMERICAL EXPERIMENTS
Experimental setup
Linear probabilistic model
Nonlinear probabilistic model
Bayesian neural network
Saw time series
DISCUSSION
Acknowledgment
...and 7 more sections

Key Result

Proposition 2.1

For any class of inference functions $\mathcal{F}$, $\mathcal{Q}_A (\mathcal{F})$ is a strict subset of $\mathcal{Q}_F$.

Figures (9)

Figure 1: The variational family $\mathcal{Q}_\text{A}$ for A-VI is a subset of the variational family $\mathcal{Q}_\text{F}$ for F-VI. (a) In general, F-VI can achieve a lower KL-divergence than A-VI. (b) Under certain conditions, however A-VI may still achieve the same optimal solution $q^*$ as F-VI.
Figure 2: For the simple hierarchical model (\ref{['eq:simple-hier']}), an ideal inference function $f_{\bf x}$ such that $f_{\bf x}(x_n) = q(z_n \,;\, \nu_n^*)$ exists. The saw time-series requires learning a map with two inputs $(x_{n -1}, x_n)$. For the Hidden Markov and dense hierarchical graphs, there is no ideal inference function. In the dense hierarchical model, there is an edge between every element of $\mathbf{z}$ and every element of $\mathbf{x}$. For clarity we removed edges between $\theta$ and $z_n$ in all graphs.
Figure 3: Examples of optimization paths. As benchmarks, we use F-VI and a constant factor algorithm which assigns the same distribution to all $q(z_n)$. A-VI is then run using different classes of inference functions: (left) we vary the degree $d$ of a learning polynomial; (middle, right) we vary the width $k$ of an inference neural network. For a sufficiently complex inference function, we find that A-VI attains the same ELBO as F-VI, meaning the amortization gap is closed. For results across multiple seeds, see Figure \ref{['fig:iter_to_convergence']}.
Figure 4: Wall time to convergence. We run each experiment 10 times and summarize the wall time required for the ELBO to converge for each VI algorithm. For the Bayesian Neural Network, we report convergence in terms of MSE for the image reconstruction. Algorithms with a collapsed box plot on the right do not close the amortization gap.
Figure 5: Image reconstruction error, as measured by MSE over pixel, for a trained Bayesian neural network. The MSE is not a one-to-one map with the ELBO. For a sufficiently expressive inference network, A-VI achieves the same error as F-VI and converges faster. The above provides the paths for a single seed; for results across several seeds, see Figure \ref{['fig:iter_to_convergence']}.
...and 4 more figures

Theorems & Definitions (22)

Proposition 2.1
proof
Definition 2.2
Lemma 3.1
proof
Definition 3.2
Definition 3.3
Theorem 3.4
proof
Remark 3.5
...and 12 more

Amortized Variational Inference: When and Why?

TL;DR

Abstract

Amortized Variational Inference: When and Why?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (22)