Table of Contents
Fetching ...

Scalable Bayesian Inference for Generalized Linear Mixed Models via Stochastic Gradient MCMC

Samuel I. Berchuck, Youngsoo Baek, Felipe A. Medeiros, Andrea Agazzi

TL;DR

This work tackles scalable Bayesian inference for GLMMs where the marginal likelihood is intractable and standard MCMC is impractical at scale. It introduces a stochastic gradient MCMC framework that leverages Fisher’s identity to form unbiased Monte Carlo gradients and couples this with a post-hoc variance correction to calibrate posterior uncertainty under minibatching and gradient approximation. Across simulations and a real-world electronic health records analysis, the proposed method yields accurate posterior means and variances and demonstrates improved uncertainty quantification over existing approaches, particularly in large-$n$ regimes. The resulting approach enables reliable, scalable Bayesian analysis for dependent data with complex random effects, with practical implications for inference and decision-making in biomedical and social science applications.

Abstract

The generalized linear mixed model (GLMM) is widely used for analyzing correlated data, particularly in large-scale biomedical and social science applications. Scalable Bayesian inference for GLMMs is challenging because the marginal likelihood is intractable and conventional Markov chain Monte Carlo (MCMC) methods become computationally prohibitive as the number of subjects grows. We develop a stochastic gradient MCMC (SGMCMC) algorithm tailored to GLMMs that enables accurate posterior inference in the large-sample regime. Our approach uses Fisher's identity to construct an unbiased Monte Carlo estimator of the gradient of the marginal log-likelihood, making SGMCMC feasible when direct gradient computation is impossible. We analyze the additional variability introduced by both minibatching and gradient approximation, and derive a post-hoc covariance correction that yields properly calibrated posterior uncertainty. Through simulations, we show that the proposed method provides accurate posterior means and variances, outperforming existing approaches, including control variate methods, in large-$n$ settings. We further demonstrate the method's practical utility in an analysis of electronic health records data, where accounting for variance inflation materially changes scientific conclusions.

Scalable Bayesian Inference for Generalized Linear Mixed Models via Stochastic Gradient MCMC

TL;DR

This work tackles scalable Bayesian inference for GLMMs where the marginal likelihood is intractable and standard MCMC is impractical at scale. It introduces a stochastic gradient MCMC framework that leverages Fisher’s identity to form unbiased Monte Carlo gradients and couples this with a post-hoc variance correction to calibrate posterior uncertainty under minibatching and gradient approximation. Across simulations and a real-world electronic health records analysis, the proposed method yields accurate posterior means and variances and demonstrates improved uncertainty quantification over existing approaches, particularly in large- regimes. The resulting approach enables reliable, scalable Bayesian analysis for dependent data with complex random effects, with practical implications for inference and decision-making in biomedical and social science applications.

Abstract

The generalized linear mixed model (GLMM) is widely used for analyzing correlated data, particularly in large-scale biomedical and social science applications. Scalable Bayesian inference for GLMMs is challenging because the marginal likelihood is intractable and conventional Markov chain Monte Carlo (MCMC) methods become computationally prohibitive as the number of subjects grows. We develop a stochastic gradient MCMC (SGMCMC) algorithm tailored to GLMMs that enables accurate posterior inference in the large-sample regime. Our approach uses Fisher's identity to construct an unbiased Monte Carlo estimator of the gradient of the marginal log-likelihood, making SGMCMC feasible when direct gradient computation is impossible. We analyze the additional variability introduced by both minibatching and gradient approximation, and derive a post-hoc covariance correction that yields properly calibrated posterior uncertainty. Through simulations, we show that the proposed method provides accurate posterior means and variances, outperforming existing approaches, including control variate methods, in large- settings. We further demonstrate the method's practical utility in an analysis of electronic health records data, where accounting for variance inflation materially changes scientific conclusions.
Paper Structure (17 sections, 5 theorems, 11 equations, 4 figures, 1 algorithm)

This paper contains 17 sections, 5 theorems, 11 equations, 4 figures, 1 algorithm.

Key Result

Lemma 3.1

For all $i \in [n]$, $\hat{g}_i(\boldsymbol{\Omega})$ as defined in (e:hatgi) and $\hat{\boldsymbol{\Psi}}_i(\boldsymbol{\Omega})$ as defined in (e:psii) are unbiased estimators of the gradient of the marginal log-likelihood, $g_i(\boldsymbol{\Omega})$, and the covariance matrix $\boldsymbol{\Psi}_i

Figures (4)

  • Figure 1: Posterior estimation of the log variance for uncorrected (black) and corrected (grey) SGLD algorithm. Each value represents the mean and 95% quantile intervals based on 100 simulated data sets. The columns represent the minibatch size ($S$) and the rows represent the sample size ($n$) and parameter. The black dashed lines indicate the true log posterior variance. Estimates are given across an appropriate range of $\delta$.
  • Figure 2: Assessing the algorithm's ability to estimate the variance of the posterior predictive distribution (PPD). Presented are the log ratio of the estimated PPD variance and the true PPD variance for both the uncorrected (black) and corrected (grey) algorithms. Columns and rows indicate batch size ($S$) and sample size ($n$), respectively. Black dashed lines indicate correct PPD variance estimation. Estimates are given across a range of $\delta$.
  • Figure 3: Log of posterior variance estimates presented across runtime (hours) for the Bernoulli GLMM model. Columns and rows indicate parameter and sample size ($n$), respectively. Algorithms include conditional (c) Gibbs sampling and the SGLD algorithm with various batch sizes ($S$). Thee variants of the SGLD are presented: vanilla SGLD (solid); SGLD initialized using the control variates algorithm by baker2019control, adapted to GLMMs (dashed); and SGLD samples corrected according to our proposal (long dashed). At each point in time, the log posterior variance was calculated using the most recent 75% of the samples up to that point. Estimates are averaged across 100 simulated data sets.
  • Figure 4: Posterior odds ratios and 95% credible intervals for $\boldsymbol{\beta}$. Summaries are presented for the uncorrected and corrected SGLD algorithm. The parameters are presented in decreasing order and color coded based on whether the credible interval contained zero.

Theorems & Definitions (9)

  • Lemma 3.1
  • proof
  • Lemma 3.2
  • proof
  • Remark 3.3
  • Lemma 3.4
  • Proposition 3.5
  • Theorem 3.6
  • proof