Peeking with PEAK: Sequential, Nonparametric Composite Hypothesis Tests for Means of Multiple Data Streams
Brian Cho, Kyra Gan, Nathan Kallus
TL;DR
PEAK delivers a nonparametric, sequential testing framework for composite mean hypotheses across multiple data streams by leveraging $e$-processes and a novel averaging scheme that avoids union bounds. It generalizes a single-stream betting rule to the multi-stream setting, yielding type-I error control and power-one guarantees under mild sampling assumptions, while maintaining computational tractability via convex optimization for region-based hypotheses. The approach supports THR and BAI as convex-region examples and demonstrates substantial practical gains, including up to $85\%$ reduction in samples before stopping and favorable runtimes on real HeartSteps data. These contributions offer a robust, anytime-valid alternative to parametric sequential tests for adaptive experiments in healthcare and related domains.
Abstract
We propose a novel nonparametric sequential test for composite hypotheses for means of multiple data streams. Our proposed method, \emph{peeking with expectation-based averaged capital} (PEAK), builds upon the testing-by-betting framework and provides a non-asymptotic $α$-level test across any stopping time. Our contributions are two-fold: (1) we propose a novel betting scheme and provide theoretical guarantees on type-I error control, power, and asymptotic growth rate/$e$-power in the setting of a single data stream; (2) we introduce PEAK, a generalization of this betting scheme to multiple streams, that (i) avoids using wasteful union bounds via averaging, (ii) is a test of power one under mild regularity conditions on the sampling scheme of the streams, and (iii) reduces computational overhead when applying the testing-as-betting approaches for pure-exploration bandit problems. We illustrate the practical benefits of PEAK using both synthetic and real-world HeartSteps datasets. Our experiments show that PEAK provides up to an 85\% reduction in the number of samples before stopping compared to existing stopping rules for pure-exploration bandit problems, and matches the performance of state-of-the-art sequential tests while improving upon computational complexity.
