Steady Continuous Monitoring is (Just Barely) Impossible for Tests of Unbounded Length
Eric Bax, Alex Shtoff
TL;DR
The paper addresses the challenge of steady continuous monitoring in AB tests with unbounded duration, where a fixed stopping rule cannot control Type I error. It formalizes unbounded tests using decision points, thresholds, and repetition requirements, and derives a general error bound: the Type I error is at most $\sum_{t=1}^{\infty} \frac{\delta_t}{r_t}$. It then advocates geometric $\alpha$-spending and, more generally, convergent $p$-series spending ($x_t \propto 1/t^v$, $v>1$) to delay the growth of required significance while maintaining error control, showing that an exactly flat curve is impossible but can be approached. The work demonstrates how repetition-based stopping can offer practical control of early stopping stringency and long-run power, and discusses combining these ideas with other always-valid bounds for robust continuous monitoring in long-running experiments.
Abstract
AB testing evaluates the difference between a control and a treatment in a statistically rigorous manner. Continuous monitoring allows statistical evaluation of an AB test as it proceeds. One goal of continuous monitoring is early stopping -- confirming a statistically significant difference between control and treatment as soon as possible. Another goal is to maintain some statistical capability to discover significant differences later in the test if they cannot be confirmed earlier. These goals are in conflict -- looser requirements for early stopping leave us with more stringent ones for later. This paper shows that it is impossible to maintain a constant requirement for significance for tests that have no a priori stopping time, but we can come arbitrarily close to that goal by using tests that require repeated significant results to con rm statistically significant differences between treatment and control.
