Table of Contents
Fetching ...

Early Stopping in Contextual Bandits and Inferences

Zihan Cui

TL;DR

This work tackles efficient data collection in linear contextual bandits by developing early stopping rules that balance in-experiment regret against sampling costs. It combines pre-determined stopping based on tail bounds of online OLS estimators with online variance-based stopping, leveraging batched data for stability and tractable inference. The authors establish regret bounds, propose an inverse-variance weighted online estimator, and provide a conditional inference framework that accounts for the realized stopping time, including a Gibbs-sampling approach for robust post-stop conclusions. The methods offer principled stopping criteria and valid post-experiment inference, with practical relevance for sequential decision-making in domains like clinical trials, online advertising, and recommendations.

Abstract

Bandit algorithms sequentially accumulate data using adaptive sampling policies, offering flexibility for real-world applications. However, excessive sampling can be costly, motivating the devolopment of early stopping methods and reliable post-experiment conditional inferences. This paper studies early stopping methods in linear contextual bandits, including both pre-determined and online stopping rules, to minimize in-experiment regrets while accounting for sampling costs. We propose stopping rules based on the Opportunity Cost and Threshold Method, utilizing the variances of unbiased or consistent online estimators to quantify the upper regret bounds of learned optimal policy. The study focuses on batched settings for stability, selecting a weighed combination of batched estimators as the online estimator and deriving its asymptotic distribution. Online statistical inferences are performed based on the selected estimator, conditional on the realized stopping time. Our proposed method provides a systematic approach to minimize in-experiment regret and conduct robust post-experiment inferences, facilitating decision-making in future applications.

Early Stopping in Contextual Bandits and Inferences

TL;DR

This work tackles efficient data collection in linear contextual bandits by developing early stopping rules that balance in-experiment regret against sampling costs. It combines pre-determined stopping based on tail bounds of online OLS estimators with online variance-based stopping, leveraging batched data for stability and tractable inference. The authors establish regret bounds, propose an inverse-variance weighted online estimator, and provide a conditional inference framework that accounts for the realized stopping time, including a Gibbs-sampling approach for robust post-stop conclusions. The methods offer principled stopping criteria and valid post-experiment inference, with practical relevance for sequential decision-making in domains like clinical trials, online advertising, and recommendations.

Abstract

Bandit algorithms sequentially accumulate data using adaptive sampling policies, offering flexibility for real-world applications. However, excessive sampling can be costly, motivating the devolopment of early stopping methods and reliable post-experiment conditional inferences. This paper studies early stopping methods in linear contextual bandits, including both pre-determined and online stopping rules, to minimize in-experiment regrets while accounting for sampling costs. We propose stopping rules based on the Opportunity Cost and Threshold Method, utilizing the variances of unbiased or consistent online estimators to quantify the upper regret bounds of learned optimal policy. The study focuses on batched settings for stability, selecting a weighed combination of batched estimators as the online estimator and deriving its asymptotic distribution. Online statistical inferences are performed based on the selected estimator, conditional on the realized stopping time. Our proposed method provides a systematic approach to minimize in-experiment regret and conduct robust post-experiment inferences, facilitating decision-making in future applications.

Paper Structure

This paper contains 13 sections, 11 theorems, 48 equations.

Key Result

Theorem 1

If $||\hat{\beta}_{t,1}-\beta_1||\leq B_t$ and $||\hat{\beta}_{t,0}-\beta_0||\leq B_t$ hold with probability at least $1-\delta$, then $R_{\pi_t^*}\leq (2B_tL)^{1+\lambda}M$ with probability at least $1-\delta$.

Theorems & Definitions (17)

  • Theorem 1
  • proof
  • Lemma 1
  • Corollary 1
  • Corollary 2
  • Lemma 2
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • ...and 7 more