Table of Contents
Fetching ...

Disentangling impact of capacity, objective, batchsize, estimators, and step-size on flow VI

Abhinav Agrawal, Justin Domke

TL;DR

This work systematically disentangles the factors affecting flow VI performance by evaluating capacity, objective, gradient estimators, batchsize, and step-size using a high-fidelity synthetic benchmark with exact samples. It introduces a scalable evaluation metric (marginal-Wasserstein) and demonstrates that high-capacity Real-NVP flows combined with large gradient batchsizes enable flow VI to approach or surpass exact inference and turnkey HMC methods under realistic parallel budgets. The authors derive a practical recipe: use high-capacity flows, the standard VI objective, reduced-variance gradient estimators when possible, and a fixed, small step-size, while training long with adaptive optimizers. The findings provide concrete guidelines for practitioners and show that, with sufficient parallelism, flow VI is a competitive alternative to HMC for challenging targets, thus broadening its applicability.

Abstract

Normalizing flow-based variational inference (flow VI) is a promising approximate inference approach, but its performance remains inconsistent across studies. Numerous algorithmic choices influence flow VI's performance. We conduct a step-by-step analysis to disentangle the impact of some of the key factors: capacity, objectives, gradient estimators, number of gradient estimates (batchsize), and step-sizes. Each step examines one factor while neutralizing others using insights from the previous steps and/or using extensive parallel computation. To facilitate high-fidelity evaluation, we curate a benchmark of synthetic targets that represent common posterior pathologies and allow for exact sampling. We provide specific recommendations for different factors and propose a flow VI recipe that matches or surpasses leading turnkey Hamiltonian Monte Carlo (HMC) methods.

Disentangling impact of capacity, objective, batchsize, estimators, and step-size on flow VI

TL;DR

This work systematically disentangles the factors affecting flow VI performance by evaluating capacity, objective, gradient estimators, batchsize, and step-size using a high-fidelity synthetic benchmark with exact samples. It introduces a scalable evaluation metric (marginal-Wasserstein) and demonstrates that high-capacity Real-NVP flows combined with large gradient batchsizes enable flow VI to approach or surpass exact inference and turnkey HMC methods under realistic parallel budgets. The authors derive a practical recipe: use high-capacity flows, the standard VI objective, reduced-variance gradient estimators when possible, and a fixed, small step-size, while training long with adaptive optimizers. The findings provide concrete guidelines for practitioners and show that, with sufficient parallelism, flow VI is a competitive alternative to HMC for challenging targets, thus broadening its applicability.

Abstract

Normalizing flow-based variational inference (flow VI) is a promising approximate inference approach, but its performance remains inconsistent across studies. Numerous algorithmic choices influence flow VI's performance. We conduct a step-by-step analysis to disentangle the impact of some of the key factors: capacity, objectives, gradient estimators, number of gradient estimates (batchsize), and step-sizes. Each step examines one factor while neutralizing others using insights from the previous steps and/or using extensive parallel computation. To facilitate high-fidelity evaluation, we curate a benchmark of synthetic targets that represent common posterior pathologies and allow for exact sampling. We provide specific recommendations for different factors and propose a flow VI recipe that matches or surpasses leading turnkey Hamiltonian Monte Carlo (HMC) methods.

Paper Structure

This paper contains 34 sections, 24 equations, 12 figures.

Figures (12)

  • Figure 1: Marginal-Wasserstein metric (\ref{['eq: wass']}) against sequential evaluations for Neal's funnel neal2001annealed in ten dimensions, with parallel budget increasing from left to right. For flow VI, sequential evaluations count optimization iterations and parallel evaluations represent batchsize. For HMC, sequential evaluations count leapfrog steps and parallel evaluations represent number of chains. (We use implementations from NumPyro and TensforFlow Probability, denoting these with $\pi\rho$ (read "pyro") and TFP, respectively.) The black dotted line indicates the marginal-Wasserstein metric under exact samples (\ref{['sec:eval']}). Flow VI is almost as accurate as exact inference and faster than HMC when using sufficient parallel budget.
  • Figure 2: Rows: Pair marginals. These targets cover various pathologies: Ill-conditioned Gaussian has high correlations, Banana has non-linear relationships, Neal's funnel has parameters whose spread depends on other parameters, Funana combines funnel-like behavior with non-linearity, and Student-t with $\nu = 1.5$ has heavier tails than Student-t with $\nu = 2.5$.
  • Figure 3: Rows: Model dimensions. Marginal-Wasserstein metric (\ref{['eq: wass']}) against number of coupling layers for different number of hidden units. Performance improves with increase in either of these levers of capacity for all targets but Student-t with $\nu = 1.5$. Heavy tails of this target create problems when optimizing $\textrm{KL}\left(p\ \middle\Vert\ q\right)$jaini2020tails. See \ref{['fig: objective wass']} for comparisons with $\textrm{KL}\left(q\ \middle\Vert\ p\right)$ optimization. (In the first row, Funana uses 3 dimensions.)
  • Figure 4: Rows: Model dimensions. Marginal-Wasserstein metric (\ref{['eq: wass']}) against the number of layers (with $32$ hidden units) for different objectives. Performance of $\textrm{KL}\left(q\ \middle\Vert\ p\right)$ (red) optimization improves as the number of layers increase, often reaching that of exact inference. A gap remains for Funana in one hundred dimensions, indicating this newly proposed density presents a significant challenge. (In the first row, Funana uses 3 dimensions.)
  • Figure 5: Rows: Number of iterations. Marginal-Wasserstein metric against the number of samples used for gradient evaluation for different targets in ten dimensions. STL (red) consistently outperforms total gradient (blue) at smaller batchsizes. However, as the batchsize increases, the difference vanishes. The performance for a given number of iterations (fixed for a row) improves significantly as the batchsize increases, indicating the impact of reduced gradient variance.
  • ...and 7 more figures