Table of Contents
Fetching ...

Reducing normalizing flow complexity for MCMC preconditioning

David Nabergoj, Erik Štrumbelj

TL;DR

A factorized preconditioning architecture is proposed that reduces NF complexity by combining a linear component with a conditional NF, improving adaptability to target geometry and achieving higher effective sample sizes on hierarchical Bayesian model posteriors with weak likelihoods and strong funnel geometries.

Abstract

Preconditioning is a key component of MCMC algorithms that improves sampling efficiency by facilitating exploration of geometrically complex target distributions through an invertible map. While linear preconditioners are often sufficient for moderately complex target distributions, recent work has explored nonlinear preconditioning with invertible neural networks as components of normalizing flows (NFs). However, empirical and theoretical studies show that overparameterized NF preconditioners can degrade sampling efficiency and fit quality. Moreover, existing NF-based approaches do not adapt their architectures to the target distribution. Related work outside of MCMC similarly finds that suitably parameterized NFs can achieve comparable or superior performance with substantially less training time or data. We propose a factorized preconditioning architecture that reduces NF complexity by combining a linear component with a conditional NF, improving adaptability to target geometry. The linear preconditioner is applied to dimensions that are approximately Gaussian, as estimated from warmup samples, while the conditional NF models more complex dimensions. Our method yields significantly better tail samples on two complex synthetic distributions and consistently better performance on a sparse logistic regression posterior across varying likelihood and prior strengths. It also achieves higher effective sample sizes on hierarchical Bayesian model posteriors with weak likelihoods and strong funnel geometries. This approach is particularly relevant for hierarchical Bayesian model analyses with limited data and could inform current theoretical and software strides in neural MCMC design.

Reducing normalizing flow complexity for MCMC preconditioning

TL;DR

A factorized preconditioning architecture is proposed that reduces NF complexity by combining a linear component with a conditional NF, improving adaptability to target geometry and achieving higher effective sample sizes on hierarchical Bayesian model posteriors with weak likelihoods and strong funnel geometries.

Abstract

Preconditioning is a key component of MCMC algorithms that improves sampling efficiency by facilitating exploration of geometrically complex target distributions through an invertible map. While linear preconditioners are often sufficient for moderately complex target distributions, recent work has explored nonlinear preconditioning with invertible neural networks as components of normalizing flows (NFs). However, empirical and theoretical studies show that overparameterized NF preconditioners can degrade sampling efficiency and fit quality. Moreover, existing NF-based approaches do not adapt their architectures to the target distribution. Related work outside of MCMC similarly finds that suitably parameterized NFs can achieve comparable or superior performance with substantially less training time or data. We propose a factorized preconditioning architecture that reduces NF complexity by combining a linear component with a conditional NF, improving adaptability to target geometry. The linear preconditioner is applied to dimensions that are approximately Gaussian, as estimated from warmup samples, while the conditional NF models more complex dimensions. Our method yields significantly better tail samples on two complex synthetic distributions and consistently better performance on a sparse logistic regression posterior across varying likelihood and prior strengths. It also achieves higher effective sample sizes on hierarchical Bayesian model posteriors with weak likelihoods and strong funnel geometries. This approach is particularly relevant for hierarchical Bayesian model analyses with limited data and could inform current theoretical and software strides in neural MCMC design.

Paper Structure

This paper contains 22 sections, 16 equations, 6 figures, 1 algorithm.

Figures (6)

  • Figure 1: HMC samples from Neal's funnel distribution with different preconditioners: a diagonal linear map, RNVP, and F-RNVP. The scatterplots show samples for random variables $x_0$ and $x_1$. Contour lines show the corresponding true negative log probability density value. The histograms represent the right tails of empirical densities for $x_0$. Note that the contours for 5, 10, and 100 are truncated on the left due to numerics, but actually extend towards $-\infty$.
  • Figure 2: HMC samples from the banana distribution with different preconditioners: a diagonal linear map, RNVP, and F-RNVP. The scatterplots show samples for random variables $x_0$ and $x_1$. Contour lines show the corresponding true negative log probability density value. The histograms represent the right tails of empirical densities for $x_0$.
  • Figure 3: Distributions of average KSD for different preconditioners and dataset sizes $n$ on the sparse German credit dataset. Boxplots are based on 30 experiments. Colored boxes denote the inter-quartile range. The average KSD in each experiment is computed with 250 random HMC samples over 150 trials. This measure incorporates many MCMC samples, avoiding the prohibitive $O(n^2)$ time and space complexity of computing KSD. The numbers below dataset size ticks represent the median number of identified approximately Gaussian dimensions according to the F-RNVP preconditoner. KSD values are not directly comparable across different dataset sizes, as each size affects the likelihood and thus the posterior. This visualization compares preconditioners across different $n$ while preserving absolute values useful for future comparisons.
  • Figure 4: ESS and scatterplots of NUTS draws for different combinations of posteriors in columns and preconditioners in rows. All posteriors use a single data point for the likelihood. Scatterplots show draws for the unconstrained standard deviation and a parameter that uses it in its Gaussian prior. The shown ESS values represent minima across all dimensions and based on 1000 sampling iterations.
  • Figure 5: Tail ESS and bulk ESS for different combinations of target distributions and preconditioners. All likelihoods utilize a single dataset row. Boxplots are based on 30 repeated runs with different random seeds. SGC and R-VI columns use log scales. The shown ESS values represent minima across all dimensions.
  • ...and 1 more figures