Table of Contents
Fetching ...

Improving precision of A/B experiments using trigger intensity

Tanmoy Das, Dohyeon Lee, Arnab Sinha

TL;DR

The paper tackles the challenge of low precision in industry A/B experiments caused by small treatment effects and the high cost of identifying all trigger observations. It introduces a sampling-based evaluation that infers trigger status from a subset of observations, deriving theoretical results showing bias decays inversely with the sample size $m$ and demonstrating substantial variance reduction. Through simulations and real online data, it shows that sampling as little as $0.1\%$ of observations can eliminate bias in simulated settings and that empirical SE can drop by about $36.5\%$, improving the ability to detect small effects. The work provides a practical methodology for cost-effective precision enhancement in production A/B platforms, with clear guidance on sample sizes and expected bias-variance tradeoffs.

Abstract

In industry, online randomized controlled experiment (a.k.a. A/B experiment) is a standard approach to measure the impact of a causal change. These experiments have small treatment effect to reduce the potential blast radius. As a result, these experiments often lack statistical significance due to low signal-to-noise ratio. A standard approach for improving the precision (or reducing the standard error) focuses only on the trigger observations, where the output of the treatment and the control model are different. Although evaluation with full information about trigger observations (full knowledge) improves the precision, detecting all such trigger observations is a costly affair. In this paper, we propose a sampling based evaluation method (partial knowledge) to reduce this cost. The randomness of sampling introduces bias in the estimated outcome. We theoretically analyze this bias and show that the bias is inversely proportional to the number of observations used for sampling. We also compare the proposed evaluation methods using simulation and empirical data. In simulation, bias in evaluation with partial knowledge effectively reduces to zero when a limited number of observations (<= 0.1%) are sampled for trigger estimation. In empirical setup, evaluation with partial knowledge reduces the standard error by 36.48%.

Improving precision of A/B experiments using trigger intensity

TL;DR

The paper tackles the challenge of low precision in industry A/B experiments caused by small treatment effects and the high cost of identifying all trigger observations. It introduces a sampling-based evaluation that infers trigger status from a subset of observations, deriving theoretical results showing bias decays inversely with the sample size and demonstrating substantial variance reduction. Through simulations and real online data, it shows that sampling as little as of observations can eliminate bias in simulated settings and that empirical SE can drop by about , improving the ability to detect small effects. The work provides a practical methodology for cost-effective precision enhancement in production A/B platforms, with clear guidance on sample sizes and expected bias-variance tradeoffs.

Abstract

In industry, online randomized controlled experiment (a.k.a. A/B experiment) is a standard approach to measure the impact of a causal change. These experiments have small treatment effect to reduce the potential blast radius. As a result, these experiments often lack statistical significance due to low signal-to-noise ratio. A standard approach for improving the precision (or reducing the standard error) focuses only on the trigger observations, where the output of the treatment and the control model are different. Although evaluation with full information about trigger observations (full knowledge) improves the precision, detecting all such trigger observations is a costly affair. In this paper, we propose a sampling based evaluation method (partial knowledge) to reduce this cost. The randomness of sampling introduces bias in the estimated outcome. We theoretically analyze this bias and show that the bias is inversely proportional to the number of observations used for sampling. We also compare the proposed evaluation methods using simulation and empirical data. In simulation, bias in evaluation with partial knowledge effectively reduces to zero when a limited number of observations (<= 0.1%) are sampled for trigger estimation. In empirical setup, evaluation with partial knowledge reduces the standard error by 36.48%.

Paper Structure

This paper contains 29 sections, 8 theorems, 38 equations, 4 figures, 4 tables.

Key Result

Theorem 5.1

With no knowledge of trigger intensity a) The estimated average treatment effect is b) Suppose, the variance of the residual is $\sigma^2(\hat{\eta_{\alpha}})$. The variance of the estimated treatment effect is

Figures (4)

  • Figure 1: Illustrative example of trigger vs non-trigger observations. Here, we show four observations. Two for control products (left) and two for treatment products (right). Top observations are examples of triggers and the bottom examples are non-triggers. In top left, there are two images (A and B) to rank. As the product is in control, ranking for product webpage, which is visible to customer, is produced by the control model. A has the highest rank so it's placed on the top. In the backend, the same images are ranked by the treatment model. As the control and treatment outputs are different. This observation is denoted as trigger. In bottom left, the control and treatment model output is the same, so this is a non-trigger observation. Similar analysis is done for treatment product (top right and bottom right). For treatment product, treatment model is used to rank images in webpage and control model is used in backend to determine trigger status.
  • Figure 2: Evaluations with no knowledge and full knowledge of trigger intensity are unbiased irrespective of the noise characteristics.
  • Figure 3: Evaluation with full knowledge of trigger intensity is more precise in comparison to evaluation with no knowledge of trigger intensity irrespective of the noise characteristics.
  • Figure 4: Comparison of estimated values with full knowledge of trigger intensity and partial knowledge of trigger intensity.

Theorems & Definitions (8)

  • Theorem 5.1
  • Theorem 5.2
  • Theorem 5.3
  • Corollary 5.3.1
  • Theorem 5.4
  • Corollary 5.4.1
  • Lemma 6.1
  • Theorem 6.2