Leveraging Sparsity to Improve No-U-Turn Sampling Efficiency for Hierarchical Bayesian Models

Cole C. Monnahan; Kasper Kristensen; James T. Thorson; Bob Carpenter

Leveraging Sparsity to Improve No-U-Turn Sampling Efficiency for Hierarchical Bayesian Models

Cole C. Monnahan, Kasper Kristensen, James T. Thorson, Bob Carpenter

TL;DR

Sparse NUTS (SNUTS), which preconditions (decorrelates and descales) posteriors using a sparse precision matrix, is introduced and it is demonstrated that preconditioning with Q converges one to two orders of magnitude faster than Stan's industry standard diagonal or dense preconditioners.

Abstract

Analysts routinely use Bayesian hierarchical models to understand natural processes. The no-U-turn sampler (NUTS) is the most widely used algorithm to sample high-dimensional, continuously differentiable models. But NUTS is slowed by high correlations, especially in high dimensions, limiting the complexity of applied analyses. Here we introduce Sparse NUTS (SNUTS), which preconditions (decorrelates and descales) posteriors using a sparse precision matrix ($Q$). We use Template Model Builder (TMB) to efficiently compute $Q$ from the mode of the Laplace approximation to the marginal posterior, then pass the preconditioned posterior to NUTS through the Bayesian software Stan for sampling. We apply SNUTS to seventeen diverse case studies to demonstrate that preconditioning with $Q$ converges one to two orders of magnitude faster than Stan's industry standard diagonal or dense preconditioners. SNUTS also outperforms preconditioning with the inverse of the covariance estimated with Pathfinder variational inference. SNUTS does not improve sampling efficiency for models with the highly varying curvature found in funnels, wide tails, or multiple modes. SNUTS is most advantageous, and can be scaled beyond $10^4$ parameters, in the presence of high dimensionality, sparseness, and high correlations, all of which are widespread in applied statistics. An open-source implementation of SNUTS is provided in the R package SparseNUTS.

Leveraging Sparsity to Improve No-U-Turn Sampling Efficiency for Hierarchical Bayesian Models

TL;DR

Abstract

). We use Template Model Builder (TMB) to efficiently compute

from the mode of the Laplace approximation to the marginal posterior, then pass the preconditioned posterior to NUTS through the Bayesian software Stan for sampling. We apply SNUTS to seventeen diverse case studies to demonstrate that preconditioning with

converges one to two orders of magnitude faster than Stan's industry standard diagonal or dense preconditioners. SNUTS also outperforms preconditioning with the inverse of the covariance estimated with Pathfinder variational inference. SNUTS does not improve sampling efficiency for models with the highly varying curvature found in funnels, wide tails, or multiple modes. SNUTS is most advantageous, and can be scaled beyond

parameters, in the presence of high dimensionality, sparseness, and high correlations, all of which are widespread in applied statistics. An open-source implementation of SNUTS is provided in the R package SparseNUTS.

Paper Structure (27 sections, 7 equations, 12 figures, 4 tables)

This paper contains 27 sections, 7 equations, 12 figures, 4 tables.

Introduction
Methods
Sparse preconditioning in NUTS
An overview of HMC
Preconditioning with a dense matrix
Efficient preconditioning with a sparse matrix
Template Model Builder overview
Calculating a sparse approximate precision matrix
The sparse NUTS algorithm
Evaluations
Case studies
Quantifying the accuracy of $Q$
Comparing sampling efficiency between algorithms
Performance in higher dimensions
Benchmarking sparse NUTS performance
...and 12 more sections

Figures (12)

Figure 1: Warmup time comparisons. Simple illustration of the impact of the ratio of marginal standard deviations (a) and correlations (b) on warmup (dashed lines) and total NUTS run time (solid lines) for a simple bivariate normal model $\mathbf{X} \sim \mathcal{N}(\mathbf{0},\Sigma)$. In (a) $\Sigma=(100\tau)$ and $\tau$ varies, while in (b) $\Sigma=(1\rho\rho1)$ and $\rho$ varies. NUTS was run three times (colors), first using Stan defaults and then after descaling only, or descaling and decorrelating the posterior prior to sampling. For all cases there were an initial 1000 warmup iterations followed by 1000 sampling iterations and using Stan defaults. Diagonal mass matrix adaptation was engaged during the warmup period.
Figure 2: Visualizations of Q. The resulting sparsity pattern detected automatically by TMB for the precision matrix $Q$ (top left), correlation matrix derived from $\Sigma=Q^{-1}$ (top right), the lower triangle Cholesky decomposition of $Q$ itself (bottom left), and the Cholesky decomposition of $Q$ after permuting by the matrix $P$ to maximize sparsity (i.e., $PQP^\top$, bottom right) for the sdmTMB spatial model. Non-zeroes are colored grey while zeroes are white (to highlight sparsity) except the correlation matrix which is colored by the correlation values. $Q$ has 81% off-diagonal zeroes, while the correlation matrix is fully dense, demonstrating the difference in sparsity between $Q$ and $Q^{-1}$. In this case the sparsity pattern in $Q$ leads to a nearly dense Cholesky (bottom left), but sparsity is recovered by first permuting $Q$ (bottom right)
Figure 3: Benchmark results. Benchmark results for a simulated Poisson SPDE model of increasing number of observations and spatial random effects with high sparsity of $Q$ by design, typically > 99%. The plots show: left) the computational cost of just the transformation of $g_q \rightarrow g_{q'}$ alone (i.e., with a negligible cost of evaluating $g_q$); center) the transformation and the cost of the gradient $g_q(q)$; and right) the memory used for the transformation. Three types of transformations (colors) are shown relative to the value from the original model.
Figure 4: Comparison to Pathfinder. The plots compare the approximation performance of TMB's approximation against Pathfinder for the case studies given in Table \ref{['tab:mods']}. The TMB method assumes the posterior is approximated by $\mathcal{N}(\hat{x}, Q^{-1})$ where $\hat{x}$ is the conditional mode and $Q$ the sparse precision matrix calculated in equation \ref{['eq:estQ']}. Each of the 20 points represents a single analysis with a different random initial parameter vector (a posterior sample) and random seed for sample generation. Means are given by filled triangles and is always 1 for TMB's method because performance is measured relative to it. Pathfinder used 4 serial paths, while $Q$ optimized once and generated samples. The wall time is shown for the two methods (top; lower is better), and the 1D Wasserstein distance (bottom; lower is better) is a measure of how well the 1000 samples approximate the posterior, with both shown relative to TMB. Models are ordered by Pathfinder performance relative to TMB.
Figure 5: Evaluation of scalability. Performance results for increasing dimensionality for two models (columns): glmmTMB is a negative binomial regression and spde is a spatial Poisson SPDE regression. Different performance metrics (rows) are shown for each algorithm (colors). Three analyses of four chains each (lines) show the variability in estimates.
...and 7 more figures

Leveraging Sparsity to Improve No-U-Turn Sampling Efficiency for Hierarchical Bayesian Models

TL;DR

Abstract

Leveraging Sparsity to Improve No-U-Turn Sampling Efficiency for Hierarchical Bayesian Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)