Improving precision of A/B experiments using trigger intensity
Tanmoy Das, Dohyeon Lee, Arnab Sinha
TL;DR
The paper tackles the challenge of low precision in industry A/B experiments caused by small treatment effects and the high cost of identifying all trigger observations. It introduces a sampling-based evaluation that infers trigger status from a subset of observations, deriving theoretical results showing bias decays inversely with the sample size $m$ and demonstrating substantial variance reduction. Through simulations and real online data, it shows that sampling as little as $0.1\%$ of observations can eliminate bias in simulated settings and that empirical SE can drop by about $36.5\%$, improving the ability to detect small effects. The work provides a practical methodology for cost-effective precision enhancement in production A/B platforms, with clear guidance on sample sizes and expected bias-variance tradeoffs.
Abstract
In industry, online randomized controlled experiment (a.k.a. A/B experiment) is a standard approach to measure the impact of a causal change. These experiments have small treatment effect to reduce the potential blast radius. As a result, these experiments often lack statistical significance due to low signal-to-noise ratio. A standard approach for improving the precision (or reducing the standard error) focuses only on the trigger observations, where the output of the treatment and the control model are different. Although evaluation with full information about trigger observations (full knowledge) improves the precision, detecting all such trigger observations is a costly affair. In this paper, we propose a sampling based evaluation method (partial knowledge) to reduce this cost. The randomness of sampling introduces bias in the estimated outcome. We theoretically analyze this bias and show that the bias is inversely proportional to the number of observations used for sampling. We also compare the proposed evaluation methods using simulation and empirical data. In simulation, bias in evaluation with partial knowledge effectively reduces to zero when a limited number of observations (<= 0.1%) are sampled for trigger estimation. In empirical setup, evaluation with partial knowledge reduces the standard error by 36.48%.
