Table of Contents
Fetching ...

Predicting fixed-sample test decisions enables anytime-valid inference

Chris Holmes, Stephen Walker

Abstract

Statistical hypothesis tests typically use prespecified sample sizes, yet data often arrive sequentially. Interim analyses invalidate classical error guarantees, while existing sequential methods require rigid testing preschedules or incur substantial losses in statistical power. We introduce a simple procedure that transforms any fixed-sample hypothesis test into an anytime-valid test while ensuring Type-I error control and near-optimal power with substantial sample savings when the null hypothesis is false. At each step, the procedure predicts the probability that a classical test would reject the null hypothesis at its fixed-sample size, treating future observations as missing data under the null hypothesis. Thresholding this probability yields an anytime-valid stopping rule. In areas such as clinical trials, stopping early and safely can ensure that subjects receive the best treatments and accelerate the development of effective therapies.

Predicting fixed-sample test decisions enables anytime-valid inference

Abstract

Statistical hypothesis tests typically use prespecified sample sizes, yet data often arrive sequentially. Interim analyses invalidate classical error guarantees, while existing sequential methods require rigid testing preschedules or incur substantial losses in statistical power. We introduce a simple procedure that transforms any fixed-sample hypothesis test into an anytime-valid test while ensuring Type-I error control and near-optimal power with substantial sample savings when the null hypothesis is false. At each step, the procedure predicts the probability that a classical test would reject the null hypothesis at its fixed-sample size, treating future observations as missing data under the null hypothesis. Thresholding this probability yields an anytime-valid stopping rule. In areas such as clinical trials, stopping early and safely can ensure that subjects receive the best treatments and accelerate the development of effective therapies.
Paper Structure (40 sections, 1 theorem, 104 equations, 26 figures, 1 table)

This paper contains 40 sections, 1 theorem, 104 equations, 26 figures, 1 table.

Key Result

Proposition 1

Assume $X_{n+1:N}$ arise from the null model $\mathbb{P}_0$ conditional on $X_{1:n}$. Then $(Q_n)_{n=0}^N$ is a martingale with respect to $(\mathcal{F}_n)$.

Figures (26)

  • Figure 1: Overview of the predictive procedure. Panel A: the classical fixed-sample test precludes early analysis. Panel B: the uncertainty in the outcome of the fixed-sample test at $n < N$ is driven by the unobserved data $X_{n+1:N}$. Panel C: we can characterise the uncertainty in the test decision at $N$, assuming the null hypothesis to be true, by simulating the missing data under the null hypothesis. Repeated testing of completed datasets estimates the probability that the corresponding fixed-sample test will reject at $N$. Panel D: The predictive test has good power and saves samples when the null hypothesis is false.
  • Figure 2: Predicting the fixed-sample decision enables anytime-valid testing. Shown is the evolution of the predicted rejection probability $Q_n$, defined as the probability that the fixed-sample test would reject at sample size $N$, conditional on data observed up to stage $n$ and assuming the null hypothesis is true. The experiment is testing $H_0:\theta=0$ versus $H_1:\theta>0$ for a normal mean $\theta$ with variance assumed known at 1. The value of $N$ is $500$ with Type I error of 0.05 for the sequential test.
  • Figure 3: Distribution of early stopping times. Shown is the cumulative distribution of the early stopping times of the experiment of the same type as in Figure \ref{['figsm1']}. The distribution records the sample size at which the $Q_n$ sequence crosses $0.95$ when the alternative hypothesis is true with $\theta^*=0.13$. 10,000 simulations were used to estimate the probability distribution.
  • Figure 4: Near-optimal power with minimal sample inflation. Power as a function of effect size is shown for a classical fixed-sample test (dotted line), the predictive anytime-valid test (bold line), and a representative anytime-valid likelihood-ratio–based method (dashed line); the latter two with a 2% sample size increase. The predictive test closely tracks the power of the fixed-sample test across effect sizes, requiring only a small increase in maximal sample size, from $N=500$ for the original test to $N'=509$ for the sequential test, to recover classical power while enabling early stopping and reducing expected sample usage under alternatives. The details of the experiment are the same as with Figures \ref{['figsm1']} and \ref{['figsm2']}. The power functions are computed for varying $\theta$ values via simulation with Monte Carlo sample sizes of 10,000.
  • Figure 5: Predictive anytime-valid analysis of a clinical trial. Predicted rejection probability $Q_t$ for the International Stroke Trial (IST), comparing death including censoring times for patients assigned to aspirin for the first 14 days of the trial versus not assigned to aspirin. Here the $t$ are represented by a discrete set of time points which are set over 15 months following the start of the trial. Full details of the experiment are presented in Section 10 of the Supplementary Materials. At each interim stage, $Q_t$ represents the probability that the original fixed-sample two-sample test would reject the null hypothesis at the planned sample size, conditional on the data observed to time $t$ and assuming the null hypothesis is true. The predicted rejection probability crosses the stopping threshold before the final sample size is reached, allowing early stopping while preserving the calibration and conclusion of the fixed-sample analysis.
  • ...and 21 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof