Table of Contents
Fetching ...

Two-phase validation sampling via principal components to improve efficiency in multi-model estimation from error-prone biomedical databases

Sarah C. Lotspeich, Cole Manschot

TL;DR

This paper tackles measurement error in covariates by proposing a two-phase validation design that extends extreme-tail sampling to multiple modeling objectives. By summarizing cross-model exposure variability with the first principal component and selecting validation subjects with extreme PC1* values, the approach allocates validation resources efficiently across several outcomes. Simulations and an NHANES application show that ETS-PC1* reduces the total variability of exposure coefficients across models, often outperforming standard SRS and single-model ETS designs, particularly when error-prone exposures are correlated or measurement error is substantial. The method is practical, scalable, and accompanied by an R package, facilitating broader adoption in large biomedical datasets with multiple error-prone covariates.

Abstract

Two-phase sampling offers a cost-effective way to validate error-prone covariate measurements in biomedical databases. Inexpensive or easy-to-obtain information is collected for the entire study in Phase I. Then, a subset of patients undergoes cost-intensive validation (e.g., expert chart review) to collect more accurate data in Phase II. When balancing primary and secondary analyses, competing models and priorities can result in poorly defined objectives for the most informative Phase II sampling criterion. Extreme tail sampling (ETS), wherein patients with the smallest and largest values of a particular quantity (like a covariate or residual) are selected, can offer great statistical efficiency in two-phase studies when focusing on a single analytic objective by targeting observations with the biggest contributions to the Fisher information. We propose an intuitive, easy-to-use approach that extends ETS to balance and prioritize explaining the largest amount of variability across multiple models of interest. Using principal components, we succinctly summarize the inherent variability of all models' error-prone exposures. Then, we sample patients with the most extreme principal components for validation. Through simulations and an application to the National Health and Nutrition Examination Survey (NHANES), the proposed strategy offered simultaneous efficiency gains across multiple models of interest. Its advantages persisted across various real-world scenarios. When designing a validation study, concentrating on a single model may be short-sighted. Strategically allocating resources more broadly balances multiple analytical goals simultaneously. Employing dimension reduction before sampling will allow this strategy to scale up well to big-data applications with many error-prone covariates.

Two-phase validation sampling via principal components to improve efficiency in multi-model estimation from error-prone biomedical databases

TL;DR

This paper tackles measurement error in covariates by proposing a two-phase validation design that extends extreme-tail sampling to multiple modeling objectives. By summarizing cross-model exposure variability with the first principal component and selecting validation subjects with extreme PC1* values, the approach allocates validation resources efficiently across several outcomes. Simulations and an NHANES application show that ETS-PC1* reduces the total variability of exposure coefficients across models, often outperforming standard SRS and single-model ETS designs, particularly when error-prone exposures are correlated or measurement error is substantial. The method is practical, scalable, and accompanied by an R package, facilitating broader adoption in large biomedical datasets with multiple error-prone covariates.

Abstract

Two-phase sampling offers a cost-effective way to validate error-prone covariate measurements in biomedical databases. Inexpensive or easy-to-obtain information is collected for the entire study in Phase I. Then, a subset of patients undergoes cost-intensive validation (e.g., expert chart review) to collect more accurate data in Phase II. When balancing primary and secondary analyses, competing models and priorities can result in poorly defined objectives for the most informative Phase II sampling criterion. Extreme tail sampling (ETS), wherein patients with the smallest and largest values of a particular quantity (like a covariate or residual) are selected, can offer great statistical efficiency in two-phase studies when focusing on a single analytic objective by targeting observations with the biggest contributions to the Fisher information. We propose an intuitive, easy-to-use approach that extends ETS to balance and prioritize explaining the largest amount of variability across multiple models of interest. Using principal components, we succinctly summarize the inherent variability of all models' error-prone exposures. Then, we sample patients with the most extreme principal components for validation. Through simulations and an application to the National Health and Nutrition Examination Survey (NHANES), the proposed strategy offered simultaneous efficiency gains across multiple models of interest. Its advantages persisted across various real-world scenarios. When designing a validation study, concentrating on a single model may be short-sighted. Strategically allocating resources more broadly balances multiple analytical goals simultaneously. Employing dimension reduction before sampling will allow this strategy to scale up well to big-data applications with many error-prone covariates.

Paper Structure

This paper contains 30 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Plot of the primary model's true exposure $X_p$ against the error-prone version $X_p^* = X_p + U$ under varied error severities, driven by the variance of the errors $U$. Each plot contains $N = 1000$ simulated observations, and the points are colored by their validation status based on extreme tail sampling on $X_p^*$ (ETS-$X_p^*$) versus extreme tail sampling on $X_p$ (ETS-$X_p$). "Counts" denotes the number of observations in each setting with the corresponding validation status. As the error variance increases, and $X_p^*$ becomes less informative about $X_p$, the number of observations that would be sampled by both ETS designs decreases, and the efficiency of ETS-$X_p^*$ is expected to drop below that of ETS-$X_p$. Still, even under the worst error setting, the ETS-$X_p^*$ design should capture more information than simple random sampling (SRS).
  • Figure 2: Plot of all models' true exposure $X_j$ against the first principal component $PC_1^*$ summarizing all the error-prone exposures $\pmb{X}^*$, focusing on the moderate error setting (error variance $= 0.5$). Each plot contains the same $N = 1000$ simulated observations, and the points are colored by their validation status based on extreme tail sampling on the first principal component $PC_1^*$ (ETS-$PC_1^*$) versus extreme tail sampling on the specific model's exposure $X_j$ (ETS-$X_j$). "Counts" denotes the number of observations in each setting with the corresponding validation status. Across all five models, the number of overlapping observations that would be sampled by both ETS designs is relatively stable, and the efficiency of ETS-$PC_1^*$ is expected to offer similarly good approximations to the ETS-$X_j$ for each model separately.
  • Figure 3: Simulation results comparing the empirical total coefficient variability across all models $\sum_{j=1}^{5}\widehat{\textrm{V}}(\hat{\beta}_{1j})$ under simple random sampling (SRS), extreme tail sampling on $X_1^*$ (ETS-$X_1^*$), and extreme tail sampling on the first principal component (ETS-$PC_1^*$) validation study designs. In A), three different covariance structures for the five exposures $X_1, \dots, X_5$ were considered. In B), three different variances $\sigma_U^2$ for the additive measurement errors $U_1, \dots, U_5$ in exposures $X_1, \dots, X_5$ were considered. In C), three different proportions of validated patients out of $N = 1000$ were considered.
  • Figure 4: Simulation results comparing the empirical efficiency under simple random sampling (SRS), extreme tail sampling on $X_1^*$ (ETS-$X_1^*$), and extreme tail sampling on the first principal component (ETS-$PC_1^*$) validation study designs. In A), three different covariance structures for the five covariates $X_1, \dots, X_5$ were considered. In B), three different variances $\sigma_U^2$ for the additive measurement errors $U_1, \dots, U_5$ in exposures $X_1, \dots, X_5$ were considered. In C), three different proportions of validated patients out of $N = 1000$ were considered.
  • Figure 5: Simulation results comparing the empirical total coefficient variability across all models (A) and empirical efficiency for the exposure coefficient per model (B) under simple random sampling (SRS), extreme tail sampling on $X_1^*$ (ETS-$X_1^*$), and extreme tail sampling on the first principal component (ETS-$PC_1^*$) validation study designs. There was a shared outcome $Y$, and it was generated from the exposures $X_1, \dots, X_5$ under scenarios where only one exposure was associated (only $\beta_1 \neq 0$ for $X_1$ or only $\beta_2 \neq 0$ for $X_2$) versus all covariates are associated (all $\beta_j \neq 0$ for $X_1,\dots,X_5$).
  • ...and 1 more figures