Setting the duration of online A/B experiments

Harrison H. Li; Chaoyu Yu

Setting the duration of online A/B experiments

Harrison H. Li, Chaoyu Yu

TL;DR

The paper addresses how to set online A/B test duration to achieve a desired CI width while accounting for persistent user-level correlations. It develops a two-component mixed-effects framework with a user-specific temporal correlation $ρ$ to describe how CI width decays with duration under ratio-treatment estimation, showing that higher $ρ$ slows CI shrinkage and that the decay saturates to $√{ρ}$ times the day-1 CI as $T$ grows. It further extends the framework with a pre-period adjustment via a pre-post estimator, introducing a pre-post correlation $λ(T)$ and giving explicit relative-efficiency results, including a closed-form expression for $λ(T)$ in terms of $T_0$ and $ρ$. The framework is validated on YouTube data across multiple metrics, demonstrating that the predicted CI-decay curves closely track observed CI widths and enabling practical, one-parameter planning tools for experiment duration.

Abstract

In designing an online A/B experiment, it is crucial to select a sample size and duration that ensure the resulting confidence interval (CI) for the treatment effect is the right width to detect an effect of meaningful magnitude with sufficient statistical power without wasting resources. While the relationship between sample size and CI width is well understood, the effect of experiment duration on CI width remains less clear. This paper provides an analytical formula for the width of a CI based on a ratio treatment effect estimator as a function of both sample size (N) and duration (T). The formula is derived from a mixed effects model with two variance components. One component, referred to as the temporal variance, persists over time for experiments where the same users are kept in the same experiment arm across different days. The remaining error variance component, by contrast, decays to zero as T gets large. The formula we derive introduces a key parameter that we call the user-specific temporal correlation (UTC), which quantifies the relative sizes of the two variance components and can be estimated from historical experiments. Higher UTC indicates a slower decay in CI width over time. On the other hand, when the UTC is 0 -- as for experiments where users shuffle in and out of the experiment across days -- the CI width decays at the standard parametric 1/T rate. We also study how access to pre-period data for the users in the experiment affects the CI width decay. We show our formula closely explains CI widths on real A/B experiments at YouTube.

Setting the duration of online A/B experiments

TL;DR

to describe how CI width decays with duration under ratio-treatment estimation, showing that higher

slows CI shrinkage and that the decay saturates to

times the day-1 CI as

grows. It further extends the framework with a pre-period adjustment via a pre-post estimator, introducing a pre-post correlation

and giving explicit relative-efficiency results, including a closed-form expression for

in terms of

and

. The framework is validated on YouTube data across multiple metrics, demonstrating that the predicted CI-decay curves closely track observed CI widths and enabling practical, one-parameter planning tools for experiment duration.

Abstract

Paper Structure (11 sections, 2 theorems, 28 equations, 3 figures, 1 table)

This paper contains 11 sections, 2 theorems, 28 equations, 3 figures, 1 table.

Introduction
CI width as a function of experiment duration
A two-component mixed effects model for A/B experiment metrics
User level modeling
Ratio treatment effect
Estimator and CI construction
User-specific temporal correlation
Adjusting for pre-period data
User-day diversion
Estimation of UTC and numerical results
Summary and discussion

Key Result

Proposition 1

Suppose Assumptions assump:iid_users-assump:constant_ratio hold for an additive metric $a_{utj}$. Then in the notation of Table table:notation we have $\sqrt{N}(\hat{\theta}-\theta) \stackrel{d}{\rightarrow} \mathcal{N}(0,V_{\hat{\theta}}(T))$ as $N \rightarrow \infty$, where

Figures (3)

Figure 1: The relationship between the width of the CI for the treatment effect (normalized by the day 1 CI width) and the experiment duration $T$. The black solid curve corresponds to the $1/\sqrt{T}$ decay rate.
Figure 2: The asymptotic standard error $\sqrt{V_{PP}(T)}$ of the pre-post estimator $\hat{\theta}_{PP}$ as a function of experiment duration $T$ is given in blue. The asymptotic standard error $\sqrt{V_{\hat{\theta}}(T)}$ of the post-period estimator $\hat{\theta}$ in a user-day experiment of size $N$ as a function of $T$ is given in brown. The pre-post estimator assumes $T_0=7$ days of pre-period data are available, and that the UTC $\rho$ is equal to 0.6 in the user experiment.
Figure 3: For each metric, the dotted line represents how we predict CI width (normalized by the day 1 CI width) would change with experiment duration $T$ based on our model, and the solid line represents the observed CI width change in actual experiments. The black solid curve describes the $1/\sqrt{T}$ decay rate.

Theorems & Definitions (7)

Remark 1
Remark 2
Proposition 1
proof
Proposition 2
proof
Remark 3

Setting the duration of online A/B experiments

TL;DR

Abstract

Setting the duration of online A/B experiments

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (7)