Table of Contents
Fetching ...

Optimal two-phase sampling designs for generalized raking estimators with multiple parameters of interest

Jasper B. Yang, Bryan E. Shepherd, Thomas Lumley, Pamela A. Shaw

Abstract

Large observational datasets, including those derived from electronic health records, are a valuable resource for medical research but are often affected by missingness, measurement error, and misclassification. Two-phase sampling with generalized raking (GR) estimation is an efficient and robust approach to statistical inference in such settings. In this approach, variables that are unavailable or measured with error in a large phase 1 cohort are obtained with higher-quality measurements in a phase 2 subsample. Previous research has studied optimal phase 2 sampling designs for inverse probability weighted (IPW) estimators in non-adaptive, multi-parameter settings, and for GR estimators in single-parameter settings. In this work, we extend these results by deriving optimal adaptive, multiwave sampling designs for IPW and GR estimators when multiple parameters are of interest. We propose several practical allocation strategies and evaluate their performance through extensive simulations and a data example from the Vanderbilt Comprehensive Care Clinic HIV Study. Our results show that independently optimizing allocation for each parameter improves efficiency over traditional case-control sampling. We also derive an integer-valued, A-optimal allocation method that typically outperforms independent optimization. Notably, we find that optimal designs for GR can differ substantially from those for IPW, and that this distinction can meaningfully affect estimator efficiency in the multiple-parameter setting. These findings offer practical guidance for future two-phase studies involving incomplete or error-prone data.

Optimal two-phase sampling designs for generalized raking estimators with multiple parameters of interest

Abstract

Large observational datasets, including those derived from electronic health records, are a valuable resource for medical research but are often affected by missingness, measurement error, and misclassification. Two-phase sampling with generalized raking (GR) estimation is an efficient and robust approach to statistical inference in such settings. In this approach, variables that are unavailable or measured with error in a large phase 1 cohort are obtained with higher-quality measurements in a phase 2 subsample. Previous research has studied optimal phase 2 sampling designs for inverse probability weighted (IPW) estimators in non-adaptive, multi-parameter settings, and for GR estimators in single-parameter settings. In this work, we extend these results by deriving optimal adaptive, multiwave sampling designs for IPW and GR estimators when multiple parameters are of interest. We propose several practical allocation strategies and evaluate their performance through extensive simulations and a data example from the Vanderbilt Comprehensive Care Clinic HIV Study. Our results show that independently optimizing allocation for each parameter improves efficiency over traditional case-control sampling. We also derive an integer-valued, A-optimal allocation method that typically outperforms independent optimization. Notably, we find that optimal designs for GR can differ substantially from those for IPW, and that this distinction can meaningfully affect estimator efficiency in the multiple-parameter setting. These findings offer practical guidance for future two-phase studies involving incomplete or error-prone data.

Paper Structure

This paper contains 26 sections, 16 equations, 2 figures, 17 tables.

Figures (2)

  • Figure 1: Visual depiction of simultaneous and sequential approaches when two outcomes are of equal interest.
  • Figure S1: Optimal stratum fractions for three strata defined on ADE or Death cases under A-optimal allocation with respect to GR (GR), A-optimal allocation with respect to IPW (IPW), univariate IPW optimal allocation with respect to $\beta_\text{ADE}$ (IPW-ADE), and univariate IPW optimal allocation with respect to $\beta_{\text{death}}$ (IPW-Death).