Stratified Sampling for Model-Assisted Estimation with Surrogate Outcomes

Reagan Mozer; Nicole E. Pashley; Luke Miratrix

Stratified Sampling for Model-Assisted Estimation with Surrogate Outcomes

Reagan Mozer, Nicole E. Pashley, Luke Miratrix

TL;DR

This paper tackles the challenge of unbiasedly estimating treatment effects in trials where outcomes are costly to measure but surrogate predictions are available. It introduces stratified sampling within the model-assisted estimation framework, deriving exact variance expressions and a Neyman-type optimal allocation to distribute human coding effort across strata. Through extensive simulations and two empirical applications (MORE and PERSUADE), the authors show that stratification yields meaningful efficiency gains when surrogate errors exhibit structured bias or heteroskedasticity, with modest additional gains from optimal allocation. The work provides practical design-based guidelines and an open-source R package (stratsampling) to implement these methods across settings where outcomes are expensive, extending beyond text data to any costly outcome measurement scenario.

Abstract

In many randomized trials, outcomes such as essays or open-ended responses must be manually scored as a preliminary step to impact analysis, a process that is costly and limiting. Model-assisted estimation offers a way to combine surrogate outcomes generated by machine learning or large language models with a human-coded subset, yet typical implementations use simple random sampling and therefore overlook systematic variation in surrogate prediction error. We extend this framework by incorporating stratified sampling to more efficiently allocate human coding effort. We derive the exact variance of the stratified model-assisted estimator, characterize conditions under which stratification improves precision, and identify a Neyman-type optimal allocation rule that oversamples strata with larger residual variance. We evaluate our methods through a comprehensive simulation study to assess finite-sample performance. Overall, we find stratification consistently improves efficiency when surrogate prediction errors exhibit structured bias or heteroskedasticity. We also present two empirical applications, one using data from an education RCT and one using a large observational corpus, to illustrate how these methods can be implemented in practice using ChatGPT-generated surrogate outcomes. Overall, this framework provides a practical design-based approach for leveraging surrogate outcomes and strategically allocating human coding effort to obtain unbiased estimates with greater efficiency. While motivated by text-as-data applications, the methodology applies broadly to any setting where outcome measurement is costly or prohibitive, and can be applied to comparisons across groups or estimating the mean of a single group.

Stratified Sampling for Model-Assisted Estimation with Surrogate Outcomes

TL;DR

Abstract

Paper Structure (58 sections, 2 theorems, 82 equations, 16 figures, 6 tables)

This paper contains 58 sections, 2 theorems, 82 equations, 16 figures, 6 tables.

Introduction
Background & Problem Setup
Model-assisted estimation
Using Stratified Sampling to Improve Precision
The stratified estimator
The estimator's variance
Estimating the variance
The variance reduction from stratification
Optimal allocation strategy
Simulation Study
The data generating process
Simulation factors
Estimation Methods
Evaluation Metrics
Results
...and 43 more sections

Key Result

Theorem 3.1

$\widehat{ATE}_S$ is unbiased for $ATE$ and its variance is where and

Figures (16)

Figure 1: Average empirical SE of estimated treatment effect under different methods across simulation scenarios. Horizontal dashed line denotes standard error under full coding.
Figure 2: Percent reduction in empirical variance of the model-assisted estimator under stratified sampling compared to SRS.
Figure 3: Relative influence of simulation design factors on empirical SE (log scale). Reference levels represent the configuration with no bias, homogeneous residual variance structure, balanced exact strata configuration, and a target $R^2$ of 0.4.
Figure 4: MDES as a function of coding budget, using stratification or the SRS method.
Figure 5: Empirical standard error of each estimator across simulation scenarios when the target $R^2 = 0.40$.
...and 11 more figures

Theorems & Definitions (3)

Theorem 3.1
Remark
Theorem 3.2

Stratified Sampling for Model-Assisted Estimation with Surrogate Outcomes

TL;DR

Abstract

Stratified Sampling for Model-Assisted Estimation with Surrogate Outcomes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (3)