A Statistically Reliable Optimization Framework for Bandit Experiments in Scientific Discovery

Tong Li; Travis Mandel; Goldie Phillips; Anna Rafferty; Eric M. Schwartz; Dehan Kong; Joseph J. Williams

A Statistically Reliable Optimization Framework for Bandit Experiments in Scientific Discovery

Tong Li, Travis Mandel, Goldie Phillips, Anna Rafferty, Eric M. Schwartz, Dehan Kong, Joseph J. Williams

Abstract

Scientific experimentation is largely driven by statistical hypothesis testing to determine significant differences in interventions. Traditionally, experimenters allocate samples uniformly between each intervention. However, such an approach may lead to suboptimal outcomes - multi-armed bandits (MABs) addresses this problem by allocating samples adaptively to maximize outcomes. Yet, two challenges have hindered the use of MABs in scientific domains. First, common hypothesis tests (e.g., $t$-tests) become invalid under adaptive sampling without correction, leading to inflated type~I and type~II errors. This is an understudied problem, and prior solutions suffer from issues such as low statistical power which prevent adoption in many practical settings. Second, practitioners must explicitly balance cumulative reward with statistical efficiency, yet no general methodology exists to quantify this trade-off across algorithms. In this paper, we study assumption modification and critical region correction approaches for hypothesis testing that enable common tests to be applied to adaptively collected data. We provide heuristic justification for its power efficiency and show in simulation that it achieves higher power than existing approaches. Further, we derive a theoretically and practically motivated objective function for adaptive experiment evaluation, which we integrate into a unified experimental framework. Our framework asks experimenters to specify an experiment extension cost for their problem, and based on that enables our proposed optimization procedure to select the bandit algorithm that best balances reward and power in their setting. We show that our approach enables practitioners to improve outcomes with only slightly more steps than uniform randomization, while retaining statistical validity.

A Statistically Reliable Optimization Framework for Bandit Experiments in Scientific Discovery

Abstract

-tests) become invalid under adaptive sampling without correction, leading to inflated type~I and type~II errors. This is an understudied problem, and prior solutions suffer from issues such as low statistical power which prevent adoption in many practical settings. Second, practitioners must explicitly balance cumulative reward with statistical efficiency, yet no general methodology exists to quantify this trade-off across algorithms. In this paper, we study assumption modification and critical region correction approaches for hypothesis testing that enable common tests to be applied to adaptively collected data. We provide heuristic justification for its power efficiency and show in simulation that it achieves higher power than existing approaches. Further, we derive a theoretically and practically motivated objective function for adaptive experiment evaluation, which we integrate into a unified experimental framework. Our framework asks experimenters to specify an experiment extension cost for their problem, and based on that enables our proposed optimization procedure to select the bandit algorithm that best balances reward and power in their setting. We show that our approach enables practitioners to improve outcomes with only slightly more steps than uniform randomization, while retaining statistical validity.

Paper Structure (46 sections, 2 theorems, 24 equations, 2 figures, 6 tables)

This paper contains 46 sections, 2 theorems, 24 equations, 2 figures, 6 tables.

Introduction
Problem Setup
Basic MAB setup
Hypothesis testing
Power Analysis
Our problems
Related Work
Hypothesis Testing with Adaptively Collected Data
Balancing Reward and Inference Objectives
Problem 1: Adjusting Efficiently for Hypothesis Testing Confidence
Our proposed test correction method
Most powerful test for simple hypotheses.
Proof sketch.
Evaluation on t-tests.
Problem 2: Trading Off Reward and Statistical Power
...and 31 more sections

Key Result

Theorem 4.1

Let $\pi$ be an arbitrary MAB algorithm. For testing simple hypotheses, $\vec{\nu}_0$ against $\vec{\nu}_1$, using data collected under $\pi$, the test with critical region constructed from classical LRT, is the most powerful test at level

Figures (2)

Figure 1: Screenshot of our optimization framework web application, showing the relative ECP-reward performance for the empirical study inspired simulation. Note the best setting for $\epsilon$-TS outperforms TS and UR near the $w=0.01$.
Figure 2: Screenshot of our optimization framework web application user input page.

Theorems & Definitions (2)

Theorem 4.1: AIT-optimality of the LRT on simple hypotheses
Lemma A.1: Algorithm Invariance of the Likelihood Ratio

A Statistically Reliable Optimization Framework for Bandit Experiments in Scientific Discovery

Abstract

A Statistically Reliable Optimization Framework for Bandit Experiments in Scientific Discovery

Authors

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (2)