Table of Contents
Fetching ...

Multi-Armed Sequential Hypothesis Testing by Betting

Ricardo J. Sandoval, Ian Waudby-Smith, Michael I. Jordan

Abstract

We consider a variant of sequential testing by betting where, at each time step, the statistician is presented with multiple data sources (arms) and obtains data by choosing one of the arms. We consider the composite global null hypothesis $\mathscr{P}$ that all arms are null in a certain sense (e.g. all dosages of a treatment are ineffective) and we are interested in rejecting $\mathscr{P}$ in favor of a composite alternative $\mathscr{Q}$ where at least one arm is non-null (e.g. there exists an effective treatment dosage). We posit an optimality desideratum that we describe informally as follows: even if several arms are non-null, we seek $e$-processes and sequential tests whose performance are as strong as the ones that have oracle knowledge about which arm generates the most evidence against $\mathscr{P}$. Formally, we generalize notions of log-optimality and expected rejection time optimality to more than one arm, obtaining matching lower and upper bounds for both. A key technical device in this optimality analysis is a modified upper-confidence-bound-like algorithm for unobservable but sufficiently "estimable" rewards. In the design of this algorithm, we derive nonasymptotic concentration inequalities for optimal wealth growth rates in the sense of Kelly [1956]. These may be of independent interest.

Multi-Armed Sequential Hypothesis Testing by Betting

Abstract

We consider a variant of sequential testing by betting where, at each time step, the statistician is presented with multiple data sources (arms) and obtains data by choosing one of the arms. We consider the composite global null hypothesis that all arms are null in a certain sense (e.g. all dosages of a treatment are ineffective) and we are interested in rejecting in favor of a composite alternative where at least one arm is non-null (e.g. there exists an effective treatment dosage). We posit an optimality desideratum that we describe informally as follows: even if several arms are non-null, we seek -processes and sequential tests whose performance are as strong as the ones that have oracle knowledge about which arm generates the most evidence against . Formally, we generalize notions of log-optimality and expected rejection time optimality to more than one arm, obtaining matching lower and upper bounds for both. A key technical device in this optimality analysis is a modified upper-confidence-bound-like algorithm for unobservable but sufficiently "estimable" rewards. In the design of this algorithm, we derive nonasymptotic concentration inequalities for optimal wealth growth rates in the sense of Kelly [1956]. These may be of independent interest.
Paper Structure (34 sections, 22 theorems, 179 equations, 3 figures, 3 algorithms)

This paper contains 34 sections, 22 theorems, 179 equations, 3 figures, 3 algorithms.

Key Result

Proposition 2.6

Let $\mathcal{P}$ be a global null hypothesis in the sense of eq:prelim-global null. Suppose that $((Y_n(1), \dots, Y_n(K)))_{n \in \mathbb{N}} \sim \mathsf P$ are i.i.d. draws from the joint distribution $\mathsf P \in {\cal P}$. Furthermore, suppose that $(f_n)_{n \in \mathbb{N}}$ is a sequence of Then, for any $\mathcal{H}$-predictable $(A_n)_{n \in \mathbb{N}}$, the process $(M_n)_{n \in \math

Figures (3)

  • Figure 1: Empirical growth rates for the one-sided bounded mean testing problem from \ref{['example:one-sided mean testing']} under "easy" (left) and "hard" (right) data generating processes. We consider three algorithms in addition to \ref{['algorithm:spruce']}. Oracle Arm has oracle access to and solely pulls the optimal arm. Round Robin pulls the arms one-by-one until all of them have been selected, and starts the process over again. Random Selection samples uniformly at random the arm to be played in round $n$. All four algorithms employ the regret-based test statistic from \ref{['algorithm:spruce']} and only differ in the way the arms are selected. Lastly, we note that the empirical growth rates of Round Robin and Random Selection are close to but nevertheless strictly greater than zero.
  • Figure 2: Distribution of stopping times when $\alpha = 0.001$ for the one-sided bounded mean testing problem from \ref{['example:one-sided mean testing']} under "easy" (left) and "hard" (right) data generating processes. In addition to \ref{['algorithm:spruce']}, we evaluate three algorithms whose description can be found the caption of \ref{['fig:bounded-mean-testing']}. We use the following shorthand for the log-increment under the optimal arm and its optimal portfolio: $\ell_{1,\mathsf Q}\left(a_{\mathsf Q} \right ) \coloneqq \log\left(\boldsymbol{\lambda}_\mathsf Q(a_{\mathsf Q})^\top \mathbf{E}_1\left(a_{\mathsf Q} \right ) \right )$.
  • Figure 3: Empirical growth rates (left) and distribution of stopping times when $\alpha=0.001$ (right) for the average treatment effect testing problem from \ref{['eq:ate-testing-problem']} for four different algorithms. Round Robin pulls the arms one-by-one until all of them have been selected, and starts the process over again. Random Selection samples uniformly at random the arm to be played in round $n$. All four algorithms compute their test statistic following the form given in \ref{['eq:ate-test-statistic']}; that is, they only differ in the way they select the arm to pull (i.e., treatment variation to test) at each time step.

Theorems & Definitions (60)

  • Definition 2.1: Test supermartingales
  • Definition 2.2: $e$-processes
  • Remark 2.3
  • Definition 2.4: Single-arm asymptotic log-optimality
  • Definition 2.5: History-oracle filtration
  • Proposition 2.6: Type-I error control under multi-armed data collection
  • Remark 2.7: On the related work of hsu2025active
  • Remark 2.8: On the related work of imbens2026demonstration
  • Definition 2.9: The oracle-history comparator class of $\mathcal{P}\text{-}e$-processes
  • Example 2.10: Two-sided bounded mean testing
  • ...and 50 more