Table of Contents
Fetching ...

metabeta -- A fast neural model for Bayesian mixed-effects regression

Alex Kipnis, Marcel Binz, Eric Schulz

TL;DR

Hierarchical data analysis via Bayesian mixed-effects regression is often hindered by the computational burden of MCMC. metabeta addresses this by training a transformer-based neural posterior estimator on simulated data with varying priors, using a two-network architecture (global and local summaries) and a normalizing-flow posterior to approximate p(\vartheta|D) efficiently; post-hoc importance sampling and conformal calibration further refine and correct credible intervals. Empirical results on toy, in-distribution, and out-of-distribution data show metabeta achieves accuracy comparable to, and sometimes exceeding, Hamiltonian Monte Carlo, while offering orders-of-magnitude faster inference and robust uncertainty quantification. The approach enables rapid prototyping and deployment of Bayesian mixed-effects models with prior information, and open-source tooling is planned to facilitate broad adoption and extension to larger problems and predictor-attention enhancements. Overall, metabeta broadens the practical applicability of Bayesian mixed-effects regression by combining amortized inference with principled uncertainty calibration.

Abstract

Hierarchical data with multiple observations per group is ubiquitous in empirical sciences and is often analyzed using mixed-effects regression. In such models, Bayesian inference gives an estimate of uncertainty but is analytically intractable and requires costly approximation using Markov Chain Monte Carlo (MCMC) methods. Neural posterior estimation shifts the bulk of computation from inference time to pre-training time, amortizing over simulated datasets with known ground truth targets. We propose metabeta, a transformer-based neural network model for Bayesian mixed-effects regression. Using simulated and real data, we show that it reaches stable and comparable performance to MCMC-based parameter estimation at a fraction of the usually required time.

metabeta -- A fast neural model for Bayesian mixed-effects regression

TL;DR

Hierarchical data analysis via Bayesian mixed-effects regression is often hindered by the computational burden of MCMC. metabeta addresses this by training a transformer-based neural posterior estimator on simulated data with varying priors, using a two-network architecture (global and local summaries) and a normalizing-flow posterior to approximate p(\vartheta|D) efficiently; post-hoc importance sampling and conformal calibration further refine and correct credible intervals. Empirical results on toy, in-distribution, and out-of-distribution data show metabeta achieves accuracy comparable to, and sometimes exceeding, Hamiltonian Monte Carlo, while offering orders-of-magnitude faster inference and robust uncertainty quantification. The approach enables rapid prototyping and deployment of Bayesian mixed-effects models with prior information, and open-source tooling is planned to facilitate broad adoption and extension to larger problems and predictor-attention enhancements. Overall, metabeta broadens the practical applicability of Bayesian mixed-effects regression by combining amortized inference with principled uncertainty calibration.

Abstract

Hierarchical data with multiple observations per group is ubiquitous in empirical sciences and is often analyzed using mixed-effects regression. In such models, Bayesian inference gives an estimate of uncertainty but is analytically intractable and requires costly approximation using Markov Chain Monte Carlo (MCMC) methods. Neural posterior estimation shifts the bulk of computation from inference time to pre-training time, amortizing over simulated datasets with known ground truth targets. We propose metabeta, a transformer-based neural network model for Bayesian mixed-effects regression. Using simulated and real data, we show that it reaches stable and comparable performance to MCMC-based parameter estimation at a fraction of the usually required time.

Paper Structure

This paper contains 27 sections, 18 equations, 8 figures, 2 tables, 2 algorithms.

Figures (8)

  • Figure 1: (A) Dataset Simulation. Given a set of priors, we sample regression parameters and noise in a cascading way. Predictors are sampled from various distributions for training and from real datasets for testing, and outcomes are generated according to equation \ref{['eq:1']}. (B) Model Pipeline. Observed data are summarized locally (per group) and globally (across groups). During training, the posterior networks learn the forward mapping from the true regression parameters to a simple multivariate base distribution, conditioned on the respective summaries and priors. During inference, we draw k samples from the base distribution, and apply the implicitly learned backward mapping to them, approximating sampling from the unknown target posterior. (C) Example Posteriors. Kernel density estimates from the posterior samples of metabeta (MB) and Hamiltonian Monte Carlo (HMC) on a toy dataset. (D) Compute Time. For test sets with $d=5$, $q=1$, $m \le 30$ and $n_i \le 70$, our model takes several orders of magnitude less time to compute in comparison to HMC. Computation time was measured on a MacBook Air M2 with 24GB of RAM.
  • Figure 2: Results based on MathAchieve. Remaining results are depicted in \ref{['app:res']}. (A) Parameter Recovery. Our model outperforms HMC on average in terms of r, bias and RMSE for all parameter types, and has fewer outliers. (B) Coverage. Our model's posterior credible intervals are on average more faithfully tuned.
  • Figure 3: Results based on MathAchieve (A) Credible Intervals. $95\%$ and $50\%$ credible intervals for metabeta and HMC, compared over different parameter values. Note that discrepancies in width are mirrored in the coverage plot of \ref{['fig:2']}B: HMC has poorer coverage for $\beta_0$ and $\sigma_0$ and its credible intervals are on average wider than metabeta's for both parameters. The plot for $\beta_4$ is omitted due to space constraints. (B) Posterior Predictive. Observed regression outputs (black) plotted against samples from the posterior predictive (colored) and its mean (grey) for both models. Curves based on kernel density estimates over data points, separately for two randomly chosen datasets.
  • Figure 4: Scatter plots of sampled synthetic predictors for two datasets.
  • Figure 5: Results based on the toy example.
  • ...and 3 more figures