Scalable Bayesian Inference for Generalized Linear Mixed Models via Stochastic Gradient MCMC
Samuel I. Berchuck, Youngsoo Baek, Felipe A. Medeiros, Andrea Agazzi
TL;DR
This work tackles scalable Bayesian inference for GLMMs where the marginal likelihood is intractable and standard MCMC is impractical at scale. It introduces a stochastic gradient MCMC framework that leverages Fisher’s identity to form unbiased Monte Carlo gradients and couples this with a post-hoc variance correction to calibrate posterior uncertainty under minibatching and gradient approximation. Across simulations and a real-world electronic health records analysis, the proposed method yields accurate posterior means and variances and demonstrates improved uncertainty quantification over existing approaches, particularly in large-$n$ regimes. The resulting approach enables reliable, scalable Bayesian analysis for dependent data with complex random effects, with practical implications for inference and decision-making in biomedical and social science applications.
Abstract
The generalized linear mixed model (GLMM) is widely used for analyzing correlated data, particularly in large-scale biomedical and social science applications. Scalable Bayesian inference for GLMMs is challenging because the marginal likelihood is intractable and conventional Markov chain Monte Carlo (MCMC) methods become computationally prohibitive as the number of subjects grows. We develop a stochastic gradient MCMC (SGMCMC) algorithm tailored to GLMMs that enables accurate posterior inference in the large-sample regime. Our approach uses Fisher's identity to construct an unbiased Monte Carlo estimator of the gradient of the marginal log-likelihood, making SGMCMC feasible when direct gradient computation is impossible. We analyze the additional variability introduced by both minibatching and gradient approximation, and derive a post-hoc covariance correction that yields properly calibrated posterior uncertainty. Through simulations, we show that the proposed method provides accurate posterior means and variances, outperforming existing approaches, including control variate methods, in large-$n$ settings. We further demonstrate the method's practical utility in an analysis of electronic health records data, where accounting for variance inflation materially changes scientific conclusions.
