Table of Contents
Fetching ...

Cross-validation Approaches for Multi-study Predictions

Boyu Ren, Prasad Patil, Francesca Dominici, Giovanni Parmigiani, Lorenzo Trippa

Abstract

We consider prediction in multiple studies with potential differences in the relationships between predictors and outcomes. Our objective is to integrate data from multiple studies to develop prediction models for unseen studies. We propose and investigate two cross-validation approaches applicable to multi-study stacking, an ensemble method that linearly combines study-specific ensemble members to produce generalizable predictions. Among our cross-validation approaches are some that avoid reuse of the same data in both the training and stacking steps, as done in earlier multi-study stacking. We prove that under mild regularity conditions the proposed cross-validation approaches produce stacked prediction functions with oracle properties. We also identify analytically in which scenarios the proposed cross-validation approaches increase prediction accuracy compared to stacking with data reuse. We perform a simulation study to illustrate these results. Finally, we apply our method to predicting mortality from long-term exposure to air pollutants, using collections of datasets.

Cross-validation Approaches for Multi-study Predictions

Abstract

We consider prediction in multiple studies with potential differences in the relationships between predictors and outcomes. Our objective is to integrate data from multiple studies to develop prediction models for unseen studies. We propose and investigate two cross-validation approaches applicable to multi-study stacking, an ensemble method that linearly combines study-specific ensemble members to produce generalizable predictions. Among our cross-validation approaches are some that avoid reuse of the same data in both the training and stacking steps, as done in earlier multi-study stacking. We prove that under mild regularity conditions the proposed cross-validation approaches produce stacked prediction functions with oracle properties. We also identify analytically in which scenarios the proposed cross-validation approaches increase prediction accuracy compared to stacking with data reuse. We perform a simulation study to illustrate these results. Finally, we apply our method to predicting mortality from long-term exposure to air pollutants, using collections of datasets.

Paper Structure

This paper contains 18 sections, 7 theorems, 97 equations, 6 figures.

Key Result

Proposition 1

Assume $\mathcal{D}$ is partitioned by study and $n_k=n$ for $k\in\{1,2,\ldots, K\}$. Fix $\beta_1,\ldots,\beta_K$ in model (eq:diff-sim). Let $L=1$ and SPFs $\hat{Y}^\ell_k$ be OLS regression functions. For any $w\in W$, where $W$ is a bounded set in $\mathbb R^{K}$, the following results hold: Here $Z$ a non-degenerate normally distributed random variable.

Figures (6)

  • Figure 1: Illustration of the relation between studies and training sets for $\mathcal{D}=\{D_1,D_2,D_3\}$ when all studies are of equal size ($n_k=n$, $k = 1,\ldots,K$).
  • Figure 2: (a) Comparisons of DR and $\text{CV}_{\text{ws}}$ in Example \ref{['ex:two-y']}. We illustrate $|\hat{U}^{\text{DR}}(w)-\hat{U}^{\text{WS}}(w)|$ (black) and $|\hat{U}^{\text{DR}}(w) - \lim_{n\to\infty}\hat{U}^{\text{DR}}(w)|$ (red) at $w = \mathbf{1}_K/K$ as a function of $n$. The dashed lines indicate the upper and lower fifth percentile of the differences simulation replicates. The solid lines illustrate the linear approximation of log-transformed average difference (black and red dots). The slopes approximate the rates of convergence of the differences when $n\to\infty$. (b-c) Comparisons of bias and standard deviation of the utility estimates from DR, $\text{CV}_{\text{ws}}$, $\text{CV}_{\text{cs}}$ stacking in Example \ref{['ex:K-regression']}. Note that in (c) the curves for DR and $\text{CV}_{\text{ws}}$ stacking overlap with each other. (d-f) Comparisons of $\text{CV}_{\text{ws}}$ and $\text{CV}_{\text{cs}}$ in generalist predictions (Example \ref{['ex:three-study']}). We visualize the contour plots of $\hat{U}^{\text{WS}}(w)$ (d) and $\hat{U}^{\text{CS}}(w)$ (e). We use a dashed line and a dot to illustrate the maximizers of $\hat{U}^{\text{WS}}(w)$ and $\hat{U}^{\text{CS}}(w)$ respectively. (f) PCA of different PFs. We include three SPFs $\hat{Y}_k$, two stacked PFs $\hat{Y}^{\text{WS}}$ and $\hat{Y}^{\text{CS}}$, and the oracle PF $Y_g$.
  • Figure 3: Comparison of DR stacking and $\text{CV}_{\text{cs}}$ when $K=2$ (left) and $K=9$ (right). The plots illustrate the differences $\mathbb E(\psi(\hat{w}^{\text{DR}}) - \psi(\hat{w}^{\text{CS}}))$ (CVCS) and $\mathbb E(\psi(\hat{w}^{\text{DR}}) - \psi(w_g^0))$ (oracle). We set $p=10$, $\beta_0 = \mathbf{1}_K$, $n = 200$, and vary $\sigma_\beta$. The expected values are calculated with 1,000 replicates.
  • Figure 4: (a-b) $\mathbb E(\psi(\hat{w}^{\text{DR}})-\psi(w_g^0))$ as a function of $K$ and $n$. Dots represent results from the Monte Carlo simulation. Lines illustrate the fitted functions $c_0 + c_1\log(K)/K$ (a) and $c_0 + c_1/\sqrt{n}$ (b) to the Monte Carlo results. (c) Comparison of generalist prediction accuracy of DR stacking and $\text{CV}_{\text{cs}}$ as measured by $\mathbb E(\psi(\hat{w}^{\text{DR}}) - \psi(\hat{w}^{\text{CS}}))$.
  • Figure 5: Comparison of DR stacking and $\text{CV}_{\text{cs}}$ in generalist predictions of mortality. Boxplots show the distributions of the differences in accuracy of the two stacking methods across 20 replicates, evaluated as average RMSE across all validation regions and years. In each replicate, we randomly select 10 regions to train, and evaluate the stacked PF on the remaining regions (39 test states in the U.S. and 47 test counties in California).
  • ...and 1 more figures

Theorems & Definitions (18)

  • Example 1: No predictors, DR
  • Example 1: No predictors, $\text{CV}_{\text{ws}}$
  • Example 2: Regression, $\text{CV}_{\text{ws}}$ vs. DR
  • Proposition 1
  • Example 1: No predictors, $\text{CV}_{\text{cs}}$
  • Example 2: Regression, $\text{CV}_{\text{cs}}$
  • Example 3
  • Proposition 2
  • Example 2: DR and $\text{CV}_{\text{cs}}$
  • Proposition 3
  • ...and 8 more