Table of Contents
Fetching ...

Enhancing External Validity of Experiments with Ongoing Sampling

Chen Wang, Shichao Han, Shan Huang

Abstract

Participants in online experiments often enroll over time, which can compromise sample representativeness due to temporal shifts in covariates. This issue is particularly critical in A/B tests, online controlled experiments extensively used to evaluate product updates, since these tests are cost-sensitive and typically short in duration. We propose a novel framework that dynamically assesses sample representativeness by dividing the ongoing sampling process into three stages. We then develop stage-specific estimators for Population Average Treatment Effects (PATE), ensuring that experimental results remain generalizable across varying experiment durations. Leveraging survival analysis, we develop a heuristic function that identifies these stages without requiring prior knowledge of population or sample characteristics, thereby keeping implementation costs low. Our approach bridges the gap between experimental findings and real-world applicability, enabling product decisions to be based on evidence that accurately represents the broader target population. We validate the effectiveness of our framework on three levels: (1) through a real-world online experiment conducted on WeChat; (2) via a synthetic experiment; and (3) by applying it to 600 A/B tests on WeChat in a platform-wide application. Additionally, we provide practical guidelines for practitioners to implement our method in real-world settings.

Enhancing External Validity of Experiments with Ongoing Sampling

Abstract

Participants in online experiments often enroll over time, which can compromise sample representativeness due to temporal shifts in covariates. This issue is particularly critical in A/B tests, online controlled experiments extensively used to evaluate product updates, since these tests are cost-sensitive and typically short in duration. We propose a novel framework that dynamically assesses sample representativeness by dividing the ongoing sampling process into three stages. We then develop stage-specific estimators for Population Average Treatment Effects (PATE), ensuring that experimental results remain generalizable across varying experiment durations. Leveraging survival analysis, we develop a heuristic function that identifies these stages without requiring prior knowledge of population or sample characteristics, thereby keeping implementation costs low. Our approach bridges the gap between experimental findings and real-world applicability, enabling product decisions to be based on evidence that accurately represents the broader target population. We validate the effectiveness of our framework on three levels: (1) through a real-world online experiment conducted on WeChat; (2) via a synthetic experiment; and (3) by applying it to 600 A/B tests on WeChat in a platform-wide application. Additionally, we provide practical guidelines for practitioners to implement our method in real-world settings.

Paper Structure

This paper contains 42 sections, 1 theorem, 34 equations, 15 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

Consider completely randomized experiments across heterogeneous groups characterized by $\bm{X}$, for all $t>T_o$, the estimated absolute difference between $\tau_t$ and $\tau$ is less than the product of $\hat{\pi}_{inf}(t)$ and the weighted average of the absolute values of heterogeneous treatment where Moreover, if we assume that the weighted average of the sum of $|\hat{\tau}_{HTE}(t,\bm{x})|

Figures (15)

  • Figure 1: Change of the average treatment effect over time.
  • Figure 2: Illustration of the different stages divided by criteria defined by specific time points.
  • Figure 3: Changes in the covariate distribution within the sample over the course of 9 days.
  • Figure 4: Debiased estimation of the average treatment effect at different experiment stages.
  • Figure 5: Bias-variance trade-off for two estimators during the representative stage over time.
  • ...and 10 more figures

Theorems & Definitions (3)

  • Definition 1: Time of Representativeness
  • Definition 2: Time of Overlap
  • Proposition 1: Upper Bound of Bias