Table of Contents
Fetching ...

Merging versus Ensembling in Multi-Study Prediction: Theoretical Insight from Random Effects

Zoe Guan, Giovanni Parmigiani, Prasad Patil

TL;DR

This work addresses how to best leverage multiple studies for prediction under cross-study heterogeneity. It develops a theoretical framework using a flexible mixed-effects data-generating model and analyzes ridge (and special-case least-squares) predictions to derive a transition point for when merging studies outperforms multi-study ensembling, and vice versa. The authors provide analytic expressions for the transition in equal-variance settings and bounds for unequal-variance scenarios, complemented by simulations and a metagenomics application to illustrate practical decisions. The results offer a principled guide for deciding whether to pool data or ensemble study-specific models, with direct relevance to fields like metagenomics where cross-study heterogeneity is common.

Abstract

A critical decision point when training predictors using multiple studies is whether studies should be combined or treated separately. We compare two multi-study prediction approaches in the presence of potential heterogeneity in predictor-outcome relationships across datasets: 1) merging all of the datasets and training a single learner, and 2) multi-study ensembling, which involves training a separate learner on each dataset and combining the predictions resulting from each learner. For ridge regression, we show analytically and confirm via simulation that merging yields lower prediction error than ensembling when the predictor-outcome relationships are relatively homogeneous across studies. However, as cross-study heterogeneity increases, there exists a transition point beyond which ensembling outperforms merging. We provide analytic expressions for the transition point in various scenarios, study asymptotic properties, and illustrate how transition point theory can be used for deciding when studies should be combined with an application from metagenomics.

Merging versus Ensembling in Multi-Study Prediction: Theoretical Insight from Random Effects

TL;DR

This work addresses how to best leverage multiple studies for prediction under cross-study heterogeneity. It develops a theoretical framework using a flexible mixed-effects data-generating model and analyzes ridge (and special-case least-squares) predictions to derive a transition point for when merging studies outperforms multi-study ensembling, and vice versa. The authors provide analytic expressions for the transition in equal-variance settings and bounds for unequal-variance scenarios, complemented by simulations and a metagenomics application to illustrate practical decisions. The results offer a principled guide for deciding whether to pool data or ensemble study-specific models, with direct relevance to fields like metagenomics where cross-study heterogeneity is common.

Abstract

A critical decision point when training predictors using multiple studies is whether studies should be combined or treated separately. We compare two multi-study prediction approaches in the presence of potential heterogeneity in predictor-outcome relationships across datasets: 1) merging all of the datasets and training a single learner, and 2) multi-study ensembling, which involves training a separate learner on each dataset and combining the predictions resulting from each learner. For ridge regression, we show analytically and confirm via simulation that merging yields lower prediction error than ensembling when the predictor-outcome relationships are relatively homogeneous across studies. However, as cross-study heterogeneity increases, there exists a transition point beyond which ensembling outperforms merging. We provide analytic expressions for the transition point in various scenarios, study asymptotic properties, and illustrate how transition point theory can be used for deciding when studies should be combined with an application from metagenomics.

Paper Structure

This paper contains 24 sections, 7 theorems, 45 equations, 12 figures, 1 table.

Key Result

Theorem 1

Suppose $\sigma_{j}^2 = \sigma^2$ for $j=1,\dots,Q$ and Define Then $E[\| \bm{Y_0} - \bm{\tilde{X}_0} \bm{\hat{\beta}_{ens}} \|_2^2 | \bm{{X}_0}] \leq E[\| \bm{Y_0} - \bm{\tilde{X}_0} \bm{\hat{\beta}_{merge}} \|_2^2 | \bm{{X}_0}]$ if and only if $\overline{\sigma^2} \geq {\tau}$.

Figures (12)

  • Figure 1: Relative performance of multi-study ensembling and merging as a function of heterogeneity in the main simulation scenario where $K=5$, $N_k=50$, $P=10$, $Q=5$, and the random effects have equal variances. MSPE: mean squared prediction error. The vertical dashed lines correspond to the theoretical transition points calculated using Theorem \ref{['thm1']}. The empirical transition point occurs at the value of $\overline{\sigma^2}$ where the log ratio of the prediction errors for ensembling and merging is 0.
  • Figure 2: Performance comparisons for three values of $\overline{\sigma^2}$ in main simulation scenario where $K=5$, $N_k=50$, $P=10$, $Q=5$, and the random effects have equal variances. MSPE: mean squared prediction error; LME: linear mixed effects model; LS,M: merged least squares learner; LS,E: ensemble learner based on least squares; R,M: merged ridge regression learner; R,E: ensemble learner based on ridge regression; L,M: merged lasso learner; L,E: ensemble learner based on lasso; NN,M: merged neural network; NN,E: ensemble learner based on neural networks; RF,M: merged random forest; RF,E: random forest.
  • Figure 3: Root mean square prediction error (RMSPE) for the first data illustration scenario with bootstrap confidence intervals. LS: least squares. NN: neural network. RF: random forest.
  • Figure 4: Root mean square prediction error (RMSPE) for the second data illustration scenario with bootstrap confidence intervals. LS: least squares. NN: neural network. RF: random forest.
  • Figure 5: Relative performance of multi-study ensembling and merging as a function of heterogeneity in main simulation scenario for misspecified linear models fit using the original predictors without basis expansion. MSPE: mean squared prediction error. The vertical dashed lines correspond to the transition points calculated using Theorem 1.
  • ...and 7 more figures

Theorems & Definitions (7)

  • Theorem 1
  • Theorem 2
  • Proposition 3
  • Corollary 1
  • Corollary 2
  • Corollary 3
  • Corollary 4