Table of Contents
Fetching ...

Steady Continuous Monitoring is (Just Barely) Impossible for Tests of Unbounded Length

Eric Bax, Alex Shtoff

TL;DR

The paper addresses the challenge of steady continuous monitoring in AB tests with unbounded duration, where a fixed stopping rule cannot control Type I error. It formalizes unbounded tests using decision points, thresholds, and repetition requirements, and derives a general error bound: the Type I error is at most $\sum_{t=1}^{\infty} \frac{\delta_t}{r_t}$. It then advocates geometric $\alpha$-spending and, more generally, convergent $p$-series spending ($x_t \propto 1/t^v$, $v>1$) to delay the growth of required significance while maintaining error control, showing that an exactly flat curve is impossible but can be approached. The work demonstrates how repetition-based stopping can offer practical control of early stopping stringency and long-run power, and discusses combining these ideas with other always-valid bounds for robust continuous monitoring in long-running experiments.

Abstract

AB testing evaluates the difference between a control and a treatment in a statistically rigorous manner. Continuous monitoring allows statistical evaluation of an AB test as it proceeds. One goal of continuous monitoring is early stopping -- confirming a statistically significant difference between control and treatment as soon as possible. Another goal is to maintain some statistical capability to discover significant differences later in the test if they cannot be confirmed earlier. These goals are in conflict -- looser requirements for early stopping leave us with more stringent ones for later. This paper shows that it is impossible to maintain a constant requirement for significance for tests that have no a priori stopping time, but we can come arbitrarily close to that goal by using tests that require repeated significant results to con rm statistically significant differences between treatment and control.

Steady Continuous Monitoring is (Just Barely) Impossible for Tests of Unbounded Length

TL;DR

The paper addresses the challenge of steady continuous monitoring in AB tests with unbounded duration, where a fixed stopping rule cannot control Type I error. It formalizes unbounded tests using decision points, thresholds, and repetition requirements, and derives a general error bound: the Type I error is at most . It then advocates geometric -spending and, more generally, convergent -series spending (, ) to delay the growth of required significance while maintaining error control, showing that an exactly flat curve is impossible but can be approached. The work demonstrates how repetition-based stopping can offer practical control of early stopping stringency and long-run power, and discusses combining these ideas with other always-valid bounds for robust continuous monitoring in long-running experiments.

Abstract

AB testing evaluates the difference between a control and a treatment in a statistically rigorous manner. Continuous monitoring allows statistical evaluation of an AB test as it proceeds. One goal of continuous monitoring is early stopping -- confirming a statistically significant difference between control and treatment as soon as possible. Another goal is to maintain some statistical capability to discover significant differences later in the test if they cannot be confirmed earlier. These goals are in conflict -- looser requirements for early stopping leave us with more stringent ones for later. This paper shows that it is impossible to maintain a constant requirement for significance for tests that have no a priori stopping time, but we can come arbitrarily close to that goal by using tests that require repeated significant results to con rm statistically significant differences between treatment and control.
Paper Structure (6 sections, 5 theorems, 41 equations, 3 figures)

This paper contains 6 sections, 5 theorems, 41 equations, 3 figures.

Key Result

Lemma 1

Consider a test, specified by an integer $d > 0$ and nonnegative values $\delta_1, \ldots, \delta_d$ and $r_1 \leq \ldots \leq r_d$, that stops and rejects the null hypothesis at the first decision point $t$ for which at least $r_t$ of the decision points $k$ from $1$ to $t$ each have $p_k \leq \del Then any solution to the following linear program is a joint probability distribution of events $F_

Figures (3)

  • Figure 1: Requiring Repetition. For $r = 3$ required repeats over $d = 6$ interim analyses each with interim analysis Type I error probability $\delta$, the worst-case probability of Type I error for the entire test is $\frac{6 \delta}{3} = 2 \delta$, since 3 interim analyses must all have Type I errors for the entire test to have it. (If six carpets each cover $\delta$ area, then at most $2 \delta$ area can be covered at least three deep.) (Figure from bax16.)
  • Figure 2: Comparison to another always-valid method. For $u = 0.2$, requiring repetition gives less stringent requirements for significance than a method based on autocorrelation for averages waudbysmith24.
  • Figure 3: Headless $p$-series $\alpha$ spending: $\mathbf{Z}$ scores for significance. For $p$-series $\alpha$ spending with $v = 1.1$, $\alpha = 0.05$, and $u = 0.1$, removing more initial values from the $p$-series before starting to allocate the series values to $x_t$ lowers required $Z$ scores on the right.

Theorems & Definitions (10)

  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Corollary 1
  • proof
  • Theorem 3
  • proof