Table of Contents
Fetching ...

Leveraging Sparsity to Improve No-U-Turn Sampling Efficiency for Hierarchical Bayesian Models

Cole C. Monnahan, Kasper Kristensen, James T. Thorson, Bob Carpenter

TL;DR

Sparse NUTS (SNUTS), which preconditions (decorrelates and descales) posteriors using a sparse precision matrix, is introduced and it is demonstrated that preconditioning with Q converges one to two orders of magnitude faster than Stan's industry standard diagonal or dense preconditioners.

Abstract

Analysts routinely use Bayesian hierarchical models to understand natural processes. The no-U-turn sampler (NUTS) is the most widely used algorithm to sample high-dimensional, continuously differentiable models. But NUTS is slowed by high correlations, especially in high dimensions, limiting the complexity of applied analyses. Here we introduce Sparse NUTS (SNUTS), which preconditions (decorrelates and descales) posteriors using a sparse precision matrix ($Q$). We use Template Model Builder (TMB) to efficiently compute $Q$ from the mode of the Laplace approximation to the marginal posterior, then pass the preconditioned posterior to NUTS through the Bayesian software Stan for sampling. We apply SNUTS to seventeen diverse case studies to demonstrate that preconditioning with $Q$ converges one to two orders of magnitude faster than Stan's industry standard diagonal or dense preconditioners. SNUTS also outperforms preconditioning with the inverse of the covariance estimated with Pathfinder variational inference. SNUTS does not improve sampling efficiency for models with the highly varying curvature found in funnels, wide tails, or multiple modes. SNUTS is most advantageous, and can be scaled beyond $10^4$ parameters, in the presence of high dimensionality, sparseness, and high correlations, all of which are widespread in applied statistics. An open-source implementation of SNUTS is provided in the R package SparseNUTS.

Leveraging Sparsity to Improve No-U-Turn Sampling Efficiency for Hierarchical Bayesian Models

TL;DR

Sparse NUTS (SNUTS), which preconditions (decorrelates and descales) posteriors using a sparse precision matrix, is introduced and it is demonstrated that preconditioning with Q converges one to two orders of magnitude faster than Stan's industry standard diagonal or dense preconditioners.

Abstract

Analysts routinely use Bayesian hierarchical models to understand natural processes. The no-U-turn sampler (NUTS) is the most widely used algorithm to sample high-dimensional, continuously differentiable models. But NUTS is slowed by high correlations, especially in high dimensions, limiting the complexity of applied analyses. Here we introduce Sparse NUTS (SNUTS), which preconditions (decorrelates and descales) posteriors using a sparse precision matrix (). We use Template Model Builder (TMB) to efficiently compute from the mode of the Laplace approximation to the marginal posterior, then pass the preconditioned posterior to NUTS through the Bayesian software Stan for sampling. We apply SNUTS to seventeen diverse case studies to demonstrate that preconditioning with converges one to two orders of magnitude faster than Stan's industry standard diagonal or dense preconditioners. SNUTS also outperforms preconditioning with the inverse of the covariance estimated with Pathfinder variational inference. SNUTS does not improve sampling efficiency for models with the highly varying curvature found in funnels, wide tails, or multiple modes. SNUTS is most advantageous, and can be scaled beyond parameters, in the presence of high dimensionality, sparseness, and high correlations, all of which are widespread in applied statistics. An open-source implementation of SNUTS is provided in the R package SparseNUTS.
Paper Structure (27 sections, 7 equations, 12 figures, 4 tables)

This paper contains 27 sections, 7 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Warmup time comparisons. Simple illustration of the impact of the ratio of marginal standard deviations (a) and correlations (b) on warmup (dashed lines) and total NUTS run time (solid lines) for a simple bivariate normal model $\mathbf{X} \sim \mathcal{N}(\mathbf{0},\Sigma)$. In (a) $\Sigma=(100\tau)$ and $\tau$ varies, while in (b) $\Sigma=(1\rho\rho1)$ and $\rho$ varies. NUTS was run three times (colors), first using Stan defaults and then after descaling only, or descaling and decorrelating the posterior prior to sampling. For all cases there were an initial 1000 warmup iterations followed by 1000 sampling iterations and using Stan defaults. Diagonal mass matrix adaptation was engaged during the warmup period.
  • Figure 2: Visualizations of Q. The resulting sparsity pattern detected automatically by TMB for the precision matrix $Q$ (top left), correlation matrix derived from $\Sigma=Q^{-1}$ (top right), the lower triangle Cholesky decomposition of $Q$ itself (bottom left), and the Cholesky decomposition of $Q$ after permuting by the matrix $P$ to maximize sparsity (i.e., $PQP^\top$, bottom right) for the sdmTMB spatial model. Non-zeroes are colored grey while zeroes are white (to highlight sparsity) except the correlation matrix which is colored by the correlation values. $Q$ has 81% off-diagonal zeroes, while the correlation matrix is fully dense, demonstrating the difference in sparsity between $Q$ and $Q^{-1}$. In this case the sparsity pattern in $Q$ leads to a nearly dense Cholesky (bottom left), but sparsity is recovered by first permuting $Q$ (bottom right)
  • Figure 3: Benchmark results. Benchmark results for a simulated Poisson SPDE model of increasing number of observations and spatial random effects with high sparsity of $Q$ by design, typically > 99%. The plots show: left) the computational cost of just the transformation of $g_q \rightarrow g_{q'}$ alone (i.e., with a negligible cost of evaluating $g_q$); center) the transformation and the cost of the gradient $g_q(q)$; and right) the memory used for the transformation. Three types of transformations (colors) are shown relative to the value from the original model.
  • Figure 4: Comparison to Pathfinder. The plots compare the approximation performance of TMB's approximation against Pathfinder for the case studies given in Table \ref{['tab:mods']}. The TMB method assumes the posterior is approximated by $\mathcal{N}(\hat{x}, Q^{-1})$ where $\hat{x}$ is the conditional mode and $Q$ the sparse precision matrix calculated in equation \ref{['eq:estQ']}. Each of the 20 points represents a single analysis with a different random initial parameter vector (a posterior sample) and random seed for sample generation. Means are given by filled triangles and is always 1 for TMB's method because performance is measured relative to it. Pathfinder used 4 serial paths, while $Q$ optimized once and generated samples. The wall time is shown for the two methods (top; lower is better), and the 1D Wasserstein distance (bottom; lower is better) is a measure of how well the 1000 samples approximate the posterior, with both shown relative to TMB. Models are ordered by Pathfinder performance relative to TMB.
  • Figure 5: Evaluation of scalability. Performance results for increasing dimensionality for two models (columns): glmmTMB is a negative binomial regression and spde is a spatial Poisson SPDE regression. Different performance metrics (rows) are shown for each algorithm (colors). Three analyses of four chains each (lines) show the variability in estimates.
  • ...and 7 more figures