Table of Contents
Fetching ...

Gradient Boosted Mixed Models: Flexible Joint Estimation of Mean and Variance Components for Clustered Data

Mitchell L. Prevett, Francis K. C. Hui, Zhi Yang Tho, A. H. Welsh, Anton H. Westveld

TL;DR

GBMixed addresses the need to model clustered data with flexible mean structure and uncertainty-aware variance components. It extends gradient boosting to jointly estimate the mean function and two variance components (random-effects and residual) through likelihood-based gradients, enabling covariate-dependent G and R while providing calibrated prediction intervals. The framework encompasses several variants (GBMixed-Base, RBoost, GBoost, GRBoost) and demonstrates superior CATE accuracy and variance recovery across simulations and real data (PBC, PSID) compared with LM, LMER, RF, XGBoost, and CF. By enabling covariate-dependent shrinkage and heteroscedastic uncertainty quantification, GBMixed supports more reliable predictive and causal inferences in complex hierarchical settings, with broad potential applications in precision medicine, economics, and policy analysis.

Abstract

Linear mixed models are widely used for clustered data, but their reliance on parametric forms limits flexibility in complex and high-dimensional settings. In contrast, gradient boosting methods achieve high predictive accuracy through nonparametric estimation, but do not accommodate clustered data structures or provide uncertainty quantification. We introduce Gradient Boosted Mixed Models (GBMixed), a framework and algorithm that extends boosting to jointly estimate mean and variance components via likelihood-based gradients. In addition to nonparametric mean estimation, the method models both random effects and residual variances as potentially covariate-dependent functions using flexible base learners such as regression trees or splines, enabling nonparametric estimation while maintaining interpretability. Simulations and real-world applications demonstrate accurate recovery of variance components, calibrated prediction intervals, and improved predictive accuracy relative to standard linear mixed models and nonparametric methods. GBMixed provides heteroscedastic uncertainty quantification and introduces boosting for heterogeneous random effects. This enables covariate-dependent shrinkage for cluster-specific predictions to adapt between population and cluster-level data. Under standard causal assumptions, the framework enables estimation of heterogeneous treatment effects with reliable uncertainty quantification.

Gradient Boosted Mixed Models: Flexible Joint Estimation of Mean and Variance Components for Clustered Data

TL;DR

GBMixed addresses the need to model clustered data with flexible mean structure and uncertainty-aware variance components. It extends gradient boosting to jointly estimate the mean function and two variance components (random-effects and residual) through likelihood-based gradients, enabling covariate-dependent G and R while providing calibrated prediction intervals. The framework encompasses several variants (GBMixed-Base, RBoost, GBoost, GRBoost) and demonstrates superior CATE accuracy and variance recovery across simulations and real data (PBC, PSID) compared with LM, LMER, RF, XGBoost, and CF. By enabling covariate-dependent shrinkage and heteroscedastic uncertainty quantification, GBMixed supports more reliable predictive and causal inferences in complex hierarchical settings, with broad potential applications in precision medicine, economics, and policy analysis.

Abstract

Linear mixed models are widely used for clustered data, but their reliance on parametric forms limits flexibility in complex and high-dimensional settings. In contrast, gradient boosting methods achieve high predictive accuracy through nonparametric estimation, but do not accommodate clustered data structures or provide uncertainty quantification. We introduce Gradient Boosted Mixed Models (GBMixed), a framework and algorithm that extends boosting to jointly estimate mean and variance components via likelihood-based gradients. In addition to nonparametric mean estimation, the method models both random effects and residual variances as potentially covariate-dependent functions using flexible base learners such as regression trees or splines, enabling nonparametric estimation while maintaining interpretability. Simulations and real-world applications demonstrate accurate recovery of variance components, calibrated prediction intervals, and improved predictive accuracy relative to standard linear mixed models and nonparametric methods. GBMixed provides heteroscedastic uncertainty quantification and introduces boosting for heterogeneous random effects. This enables covariate-dependent shrinkage for cluster-specific predictions to adapt between population and cluster-level data. Under standard causal assumptions, the framework enables estimation of heterogeneous treatment effects with reliable uncertainty quantification.

Paper Structure

This paper contains 45 sections, 93 equations, 13 figures, 11 tables, 1 algorithm.

Figures (13)

  • Figure 1: Flowchart of the GBMixed algorithm. At each iteration, the mean function $f(\cdot)$, random effects covariance $\boldsymbol{G}_i$, and residual variances $\boldsymbol{R}_i$ are updated using gradients as pseudo-responses. Here, $\hat{\boldsymbol{\mu}}$ denotes the fitted means implied by the learned function $f(\cdot)$, and $\hat{\boldsymbol{\Sigma}}_i = \boldsymbol{Z}_i \hat{\boldsymbol{G}}_i \boldsymbol{Z}_i^\top + \hat{\boldsymbol{R}}_i$ is the estimated marginal covariance matrix for group $i$. Full definitions of the notation are provided in Section \ref{['sec:algorithm']}.
  • Figure 2: Experiment A Test-set CATE predictions vs ground truth for a representative replication.
  • Figure 3: Partial dependence of residual variance on $x_2$ (left: V-shape) and $x_5$ (right: step) for a representative replication.
  • Figure 4: Variable importance for residual variance (left, RBoost) and random-effect variance (right, GBoost) for a representative replication. The dominant drivers align with the data-generating variance patterns.
  • Figure 5: Variable importance from GBMixed with MARS base learners. Alkaline phosphatase, bilirubin, prothrombin time, cholesterol, and albumin rank as the top predictors, motivating closer inspection through partial dependence plots.
  • ...and 8 more figures