Multi-Armed Sequential Hypothesis Testing by Betting

Ricardo J. Sandoval; Ian Waudby-Smith; Michael I. Jordan

Multi-Armed Sequential Hypothesis Testing by Betting

Ricardo J. Sandoval, Ian Waudby-Smith, Michael I. Jordan

Abstract

We consider a variant of sequential testing by betting where, at each time step, the statistician is presented with multiple data sources (arms) and obtains data by choosing one of the arms. We consider the composite global null hypothesis $\mathscr{P}$ that all arms are null in a certain sense (e.g. all dosages of a treatment are ineffective) and we are interested in rejecting $\mathscr{P}$ in favor of a composite alternative $\mathscr{Q}$ where at least one arm is non-null (e.g. there exists an effective treatment dosage). We posit an optimality desideratum that we describe informally as follows: even if several arms are non-null, we seek $e$-processes and sequential tests whose performance are as strong as the ones that have oracle knowledge about which arm generates the most evidence against $\mathscr{P}$. Formally, we generalize notions of log-optimality and expected rejection time optimality to more than one arm, obtaining matching lower and upper bounds for both. A key technical device in this optimality analysis is a modified upper-confidence-bound-like algorithm for unobservable but sufficiently "estimable" rewards. In the design of this algorithm, we derive nonasymptotic concentration inequalities for optimal wealth growth rates in the sense of Kelly [1956]. These may be of independent interest.

Multi-Armed Sequential Hypothesis Testing by Betting

Abstract

that all arms are null in a certain sense (e.g. all dosages of a treatment are ineffective) and we are interested in rejecting

in favor of a composite alternative

where at least one arm is non-null (e.g. there exists an effective treatment dosage). We posit an optimality desideratum that we describe informally as follows: even if several arms are non-null, we seek

-processes and sequential tests whose performance are as strong as the ones that have oracle knowledge about which arm generates the most evidence against

. Formally, we generalize notions of log-optimality and expected rejection time optimality to more than one arm, obtaining matching lower and upper bounds for both. A key technical device in this optimality analysis is a modified upper-confidence-bound-like algorithm for unobservable but sufficiently "estimable" rewards. In the design of this algorithm, we derive nonasymptotic concentration inequalities for optimal wealth growth rates in the sense of Kelly [1956]. These may be of independent interest.

Paper Structure (34 sections, 22 theorems, 179 equations, 3 figures, 3 algorithms)

This paper contains 34 sections, 22 theorems, 179 equations, 3 figures, 3 algorithms.

Introduction
Notation.
Preliminaries
Sequential hypothesis testing by betting
Log-optimality of e-processes
The multi-armed data collection protocol
Global null hypothesis testing
A nonparametric class of test supermartingales and e-processes
Multi-Armed Log-Optimality
Portfolio and allocation regret
Achieving multi-armed log-optimality via SPRUCE
Analyzing the Expected Time to Rejection
Proof Ingredients for Sublinear Allocation Regret
Testing for the Existence of a Treatment Effect
Conclusions
...and 19 more sections

Key Result

Proposition 2.6

Let $\mathcal{P}$ be a global null hypothesis in the sense of eq:prelim-global null. Suppose that $((Y_n(1), \dots, Y_n(K)))_{n \in \mathbb{N}} \sim \mathsf P$ are i.i.d. draws from the joint distribution $\mathsf P \in {\cal P}$. Furthermore, suppose that $(f_n)_{n \in \mathbb{N}}$ is a sequence of Then, for any $\mathcal{H}$-predictable $(A_n)_{n \in \mathbb{N}}$, the process $(M_n)_{n \in \math

Figures (3)

Figure 1: Empirical growth rates for the one-sided bounded mean testing problem from \ref{['example:one-sided mean testing']} under "easy" (left) and "hard" (right) data generating processes. We consider three algorithms in addition to \ref{['algorithm:spruce']}. Oracle Arm has oracle access to and solely pulls the optimal arm. Round Robin pulls the arms one-by-one until all of them have been selected, and starts the process over again. Random Selection samples uniformly at random the arm to be played in round $n$. All four algorithms employ the regret-based test statistic from \ref{['algorithm:spruce']} and only differ in the way the arms are selected. Lastly, we note that the empirical growth rates of Round Robin and Random Selection are close to but nevertheless strictly greater than zero.
Figure 2: Distribution of stopping times when $\alpha = 0.001$ for the one-sided bounded mean testing problem from \ref{['example:one-sided mean testing']} under "easy" (left) and "hard" (right) data generating processes. In addition to \ref{['algorithm:spruce']}, we evaluate three algorithms whose description can be found the caption of \ref{['fig:bounded-mean-testing']}. We use the following shorthand for the log-increment under the optimal arm and its optimal portfolio: $\ell_{1,\mathsf Q}\left(a_{\mathsf Q} \right ) \coloneqq \log\left(\boldsymbol{\lambda}_\mathsf Q(a_{\mathsf Q})^\top \mathbf{E}_1\left(a_{\mathsf Q} \right ) \right )$.
Figure 3: Empirical growth rates (left) and distribution of stopping times when $\alpha=0.001$ (right) for the average treatment effect testing problem from \ref{['eq:ate-testing-problem']} for four different algorithms. Round Robin pulls the arms one-by-one until all of them have been selected, and starts the process over again. Random Selection samples uniformly at random the arm to be played in round $n$. All four algorithms compute their test statistic following the form given in \ref{['eq:ate-test-statistic']}; that is, they only differ in the way they select the arm to pull (i.e., treatment variation to test) at each time step.

Theorems & Definitions (60)

Definition 2.1: Test supermartingales
Definition 2.2: $e$-processes
Remark 2.3
Definition 2.4: Single-arm asymptotic log-optimality
Definition 2.5: History-oracle filtration
Proposition 2.6: Type-I error control under multi-armed data collection
Remark 2.7: On the related work of hsu2025active
Remark 2.8: On the related work of imbens2026demonstration
Definition 2.9: The oracle-history comparator class of $\mathcal{P}\text{-}e$-processes
Example 2.10: Two-sided bounded mean testing
...and 50 more

Multi-Armed Sequential Hypothesis Testing by Betting

Abstract

Multi-Armed Sequential Hypothesis Testing by Betting

Authors

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (60)