Table of Contents
Fetching ...

M-estimation under Two-Phase Multiwave Sampling with Applications to Prediction-Powered Inference

Dan M. Kluger, Stephen Bates

TL;DR

This work focuses on the case where proxies for the expensive variables -- such as predictions from pretrained machine learning models -- are available for all units and proposes a Multiwave Predict-Then-Debias estimator that combines proxy information with the expensive, higher-quality measurements to improve efficiency while removing bias.

Abstract

In two-phase multiwave sampling, inexpensive measurements are collected on a large sample and expensive, more informative measurements are adaptively obtained on subsets of units across multiple waves. Adaptively collecting the expensive measurements can increase efficiency but complicates statistical inference. We give valid estimators and confidence intervals for M-estimation under adaptive two-phase multiwave sampling. We focus on the case where proxies for the expensive variables -- such as predictions from pretrained machine learning models -- are available for all units and propose a Multiwave Predict-Then-Debias estimator that combines proxy information with the expensive, higher-quality measurements to improve efficiency while removing bias. We establish asymptotic linearity and normality and propose asymptotically valid confidence intervals. We also develop an approximately greedy sampling strategy that improves efficiency relative to uniform sampling. Data-based simulation studies support the theoretical results and demonstrate efficiency gains.

M-estimation under Two-Phase Multiwave Sampling with Applications to Prediction-Powered Inference

TL;DR

This work focuses on the case where proxies for the expensive variables -- such as predictions from pretrained machine learning models -- are available for all units and proposes a Multiwave Predict-Then-Debias estimator that combines proxy information with the expensive, higher-quality measurements to improve efficiency while removing bias.

Abstract

In two-phase multiwave sampling, inexpensive measurements are collected on a large sample and expensive, more informative measurements are adaptively obtained on subsets of units across multiple waves. Adaptively collecting the expensive measurements can increase efficiency but complicates statistical inference. We give valid estimators and confidence intervals for M-estimation under adaptive two-phase multiwave sampling. We focus on the case where proxies for the expensive variables -- such as predictions from pretrained machine learning models -- are available for all units and propose a Multiwave Predict-Then-Debias estimator that combines proxy information with the expensive, higher-quality measurements to improve efficiency while removing bias. We establish asymptotic linearity and normality and propose asymptotically valid confidence intervals. We also develop an approximately greedy sampling strategy that improves efficiency relative to uniform sampling. Data-based simulation studies support the theoretical results and demonstrate efficiency gains.
Paper Structure (79 sections, 37 theorems, 363 equations, 2 figures, 1 table)

This paper contains 79 sections, 37 theorems, 363 equations, 2 figures, 1 table.

Key Result

Theorem 1

Under two-phase proxy-assisted multiwave sampling and Assumptions assump:IIDUnderlyingData, assump:LabellingRuleOverlap, and assump:SmoothEnoughForAsymptoticLineariaty, Moreover, the above are $O_p(1)$.

Figures (2)

  • Figure 1: Comparison of two-phase multiwave sampling strategies. The baseline strategy (grey) involves one wave of uniform random sampling and is compared to adaptive sampling with either $2$, $6$, or $26$ waves in Phase II and with either the stratified approach described in Section \ref{['sec:EstimateOptimalLabellingProbabilityInEachStrata']} (blue) or a kNN-based approach described in Section \ref{['sec:EstimateOptimalStrategyWithML']} (green) for approximating a greedy optimal labelling rule. For all sampling strategies considered the (Multiwave) Predict-Then-Debias estimator with the optimal tuning matrix was used. Each column corresponds to a different experiment, with the column number corresponding to the experiment number. The first row shows the RMSEs calculated across the $1{,}000$ simulations after being rescaled by a constant so that the uniform sampling baseline would have an RMSE of $1$. The second row shows the efficiency relative to the uniform sampling baseline averaged across the $1{,}000$ simulations. The third row gives the empirical coverage of the 90% confidence intervals across the $1{,}000$ simulations with the dashed line giving the nominal coverage. The error bars give $\pm 2$ standard errors for the evaluation metric being plotted. Note that the uniform sampling baseline has smaller standard errors because $6{,}000$ simulations of the baseline were conducted. In the AlphaFold experiments, we do not consider kNN approaches due to the feature space being discrete.
  • Figure 2: Histograms of the Multiwave Predict-Then-Debias estimator under two-phase multiwave sampling across $1{,}000$ simulations. Each histogram corresponds to a different experiment (Table \ref{['table:ExperimentSummary']}) and we depict the results when the number of waves $K=6$. For all experiments, with the exception of the AlphaFold one, the kNN-based approach for estimating the greedy optimal labelling rule (Section \ref{['sec:EstimateOptimalStrategyWithML']}) was used. The histogram y-axis is rescaled to the density scale with the red lines giving a Gaussian distribution with a mean and variance matching the empirical ones from the simulation. The panel titles give the p-value from a Shapiro-Wilks test for normality.

Theorems & Definitions (72)

  • Theorem 1
  • Corollary 2
  • Theorem 3
  • Proposition 4
  • Proposition 5
  • Remark 1
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • ...and 62 more