Early Stopping Based on Repeated Significance
Eric Bax, Arundhyoti Sarkar, Alex Shtoff
TL;DR
This paper addresses AB/bucket testing with multiple success criteria and the potential for early stopping while controlling the type I error rate. It develops a framework that budgets the error probability across decision points and criteria using Bonferroni/Boole bounds and repetition (nearly uniform validation), enabling stopping rules based on repeated significance such as $p \le \frac{\alpha}{d m}$ and $p \le \frac{\alpha r}{d m}$. Key contributions include general theorems for repeated significance, flexible $\alpha$-spending schedules for both finite and unlimited decision points, and practical guidance for test planning under multiple metrics, including continuous monitoring scenarios. The approach yields distribution-free validity and supports rapid, multi-criteria experimentation in online and clinical settings by balancing early stopping opportunities with power considerations via $Z$-scores and $p$-value budgeting, e.g., $u = \frac{r}{d}$ leading to $p \le \alpha u$ when monitoring continuously.
Abstract
For a bucket test with a single criterion for success and a fixed number of samples or testing period, requiring a $p$-value less than a specified value of $α$ for the success criterion produces statistical confidence at level $1 - α$. For multiple criteria, a Bonferroni correction that partitions $α$ among the criteria produces statistical confidence, at the cost of requiring lower $p$-values for each criterion. The same concept can be applied to decisions about early stopping, but that can lead to strict requirements for $p$-values. We show how to address that challenge by requiring criteria to be successful at multiple decision points.
