Early Stopping Based on Repeated Significance

Eric Bax; Arundhyoti Sarkar; Alex Shtoff

Early Stopping Based on Repeated Significance

Eric Bax, Arundhyoti Sarkar, Alex Shtoff

TL;DR

This paper addresses AB/bucket testing with multiple success criteria and the potential for early stopping while controlling the type I error rate. It develops a framework that budgets the error probability across decision points and criteria using Bonferroni/Boole bounds and repetition (nearly uniform validation), enabling stopping rules based on repeated significance such as $p \le \frac{\alpha}{d m}$ and $p \le \frac{\alpha r}{d m}$. Key contributions include general theorems for repeated significance, flexible $\alpha$-spending schedules for both finite and unlimited decision points, and practical guidance for test planning under multiple metrics, including continuous monitoring scenarios. The approach yields distribution-free validity and supports rapid, multi-criteria experimentation in online and clinical settings by balancing early stopping opportunities with power considerations via $Z$-scores and $p$-value budgeting, e.g., $u = \frac{r}{d}$ leading to $p \le \alpha u$ when monitoring continuously.

Abstract

For a bucket test with a single criterion for success and a fixed number of samples or testing period, requiring a $p$-value less than a specified value of $α$ for the success criterion produces statistical confidence at level $1 - α$. For multiple criteria, a Bonferroni correction that partitions $α$ among the criteria produces statistical confidence, at the cost of requiring lower $p$-values for each criterion. The same concept can be applied to decisions about early stopping, but that can lead to strict requirements for $p$-values. We show how to address that challenge by requiring criteria to be successful at multiple decision points.

Early Stopping Based on Repeated Significance

TL;DR

and

. Key contributions include general theorems for repeated significance, flexible

-spending schedules for both finite and unlimited decision points, and practical guidance for test planning under multiple metrics, including continuous monitoring scenarios. The approach yields distribution-free validity and supports rapid, multi-criteria experimentation in online and clinical settings by balancing early stopping opportunities with power considerations via

-scores and

-value budgeting, e.g.,

leading to

when monitoring continuously.

Abstract

For a bucket test with a single criterion for success and a fixed number of samples or testing period, requiring a

-value less than a specified value of

for the success criterion produces statistical confidence at level

. For multiple criteria, a Bonferroni correction that partitions

among the criteria produces statistical confidence, at the cost of requiring lower

-values for each criterion. The same concept can be applied to decisions about early stopping, but that can lead to strict requirements for

-values. We show how to address that challenge by requiring criteria to be successful at multiple decision points.

Paper Structure (7 sections, 6 theorems, 29 equations, 7 figures)

This paper contains 7 sections, 6 theorems, 29 equations, 7 figures.

Introduction
Multiple Criteria, Bonferroni, and Boole
Early Stopping
Requiring Repetition
Continuous Monitoring and General Result
Analysis and Type II Error Strategy
Discussion

Key Result

Theorem 1

Let $d$ be the number of decision points and $m$ be the number of bucket criteria. Then requiring $p$-values at least once for each criteria $i$ gives confidence $1 - \alpha$ that all criteria hold.

Figures (7)

Figure 1: Worst Case. If six statements each have probability at most $\delta$ of being incorrect, then the probability that the combined statement: "statement 1 and statement 2 and $\ldots$ and statement 6" is incorrect may be as high as $6 \delta$, because only one statement has to be incorrect for the "and" of all 6 to be incorrect. The worst case is that failures are disjoint -- like how spreading six carpets so that they do not overlap covers as much area as possible. (Figure from bax16.)
Figure 2: Allowing Failures/Requiring Repetition. If six statements each have probability at most $\delta$ of being incorrect, then the worst-case probability that three or more are incorrect is $2 \delta$. The worst case is that any failure is simultaneous with two others -- with six carpets, laying them three-thick only covers an area equal to two carpets. (Figure from bax16.)
Figure 3: Different Numbers of Repetitions. Suppose we have two criteria for bucket success, and we require the first to hold three times and the second to hold twice to declare the test a success. Suppose there are $d = 6$ decision points and each statement that a criterion holds has probability $\delta$ of being incorrect. Then an incorrect conclusion from the bucket requires either incorrect statements that criterion one holds at three different decision points (white "carpets" piled three-deep) or incorrect statements that criterion two holds at two different decision points (gray "carpets" piled two-deep). So the probability of bucket success without the criteria actually both holding is at most $5 \delta$.
Figure 4: $\mathbf{Z}$ score required, by $\mathbf{dm}$. Each $Z$ score is the inverse of standard normal cdf for $1 - \frac{p}{2}$ with $p = \frac{\alpha}{dm}$. The $Z$ score required for significance under a uniform partition of $\alpha$ increases as the number of decision points $d$ and criteria $m$ increase, slicing $\alpha$ more finely. The plots begin at $dm = 1$. The $Z$ score required for $dm = 1$ is the $Z$ score without the possibility of early stopping (one decision point at the end of the test) and for a single criterion. The required $Z$ scores have decreasing marginal increases as $dm$ increases.
Figure 5: $\mathbf{Z}$ score and sample size impact for partitioning $\mathbf{\alpha}$. The top curve is the $Z$ score required for statistical significance given the $p$-value required for statistical significance on the $x$-axis. The bottom curve is the approximate reduction in sample size (compared to $p = 0.05$) for the purpose of achieving statistical significance to avoid type II error. (Each point in the lower curve is the square of the ratio of the $Z$ score for $p = 0.05$ to the $Z$-score for the $p$-value on the $x$-axis.
...and 2 more figures

Theorems & Definitions (12)

Theorem 1
proof
Theorem 2
proof
Theorem 3
proof
Theorem 4
proof
Theorem 5
proof
...and 2 more

Early Stopping Based on Repeated Significance

TL;DR

Abstract

Early Stopping Based on Repeated Significance

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (12)