Table of Contents
Fetching ...

Multi-Armed Bandits with Abstention

Junwen Yang, Tianyuan Jin, Vincent Y. F. Tan

TL;DR

This work extends the canonical multi-armed bandit problem by introducing an abstention option that can be taken before observing the stochastic reward. It analyzes two regret paradigms, fixed-regret and fixed-reward, and develops algorithms that achieve both asymptotic and minimax optimality in each setting: FRG-TSwA for fixed-regret and FRW-ALGwA for fixed-reward. The main results provide tight instance-dependent and worst-case regret bounds, including $\limsup_{T\to\infty} \frac{R^{\mathrm{RG}}_{\mu,c}(T)}{\log T} \le 2\sum_{i>1} \frac{\Delta_i \wedge c}{\Delta_i^2}$ and $\limsup_{T\to\infty} \frac{R^{\mathrm{RW}}_{\mu,c}(T)}{\log T} \le 2\sum_{i>1} \frac{(\mu_1 \vee c)-(\mu_i \vee c)}{\Delta_i^2}$, along with minimax bounds of order $O(\sqrt{KT})$ and phase-transition behavior in the fixed-regret setting. The paper also shows how abstention can be leveraged to reduce exploration costs and verifies the theory with numerical experiments. This work lays a foundation for online decision problems with abstention and points to future work on richer abstention dynamics such as linear bandits.

Abstract

We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic element: abstention. In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to abstain from accepting the stochastic instantaneous reward before observing it. When opting for abstention, the agent either suffers a fixed regret or gains a guaranteed reward. Given this added layer of complexity, we ask whether we can develop efficient algorithms that are both asymptotically and minimax optimal. We answer this question affirmatively by designing and analyzing algorithms whose regrets meet their corresponding information-theoretic lower bounds. Our results offer valuable quantitative insights into the benefits of the abstention option, laying the groundwork for further exploration in other online decision-making problems with such an option. Numerical results further corroborate our theoretical findings.

Multi-Armed Bandits with Abstention

TL;DR

This work extends the canonical multi-armed bandit problem by introducing an abstention option that can be taken before observing the stochastic reward. It analyzes two regret paradigms, fixed-regret and fixed-reward, and develops algorithms that achieve both asymptotic and minimax optimality in each setting: FRG-TSwA for fixed-regret and FRW-ALGwA for fixed-reward. The main results provide tight instance-dependent and worst-case regret bounds, including and , along with minimax bounds of order and phase-transition behavior in the fixed-regret setting. The paper also shows how abstention can be leveraged to reduce exploration costs and verifies the theory with numerical experiments. This work lays a foundation for online decision problems with abstention and points to future work on richer abstention dynamics such as linear bandits.

Abstract

We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic element: abstention. In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to abstain from accepting the stochastic instantaneous reward before observing it. When opting for abstention, the agent either suffers a fixed regret or gains a guaranteed reward. Given this added layer of complexity, we ask whether we can develop efficient algorithms that are both asymptotically and minimax optimal. We answer this question affirmatively by designing and analyzing algorithms whose regrets meet their corresponding information-theoretic lower bounds. Our results offer valuable quantitative insights into the benefits of the abstention option, laying the groundwork for further exploration in other online decision-making problems with such an option. Numerical results further corroborate our theoretical findings.
Paper Structure (38 sections, 11 theorems, 101 equations, 7 figures, 2 algorithms)

This paper contains 38 sections, 11 theorems, 101 equations, 7 figures, 2 algorithms.

Key Result

Theorem 3.3

For all abstention regrets $c>0$ and bandit instances $\mu\in\mathcal{U}$, Algorithm algo1 guarantees that Furthermore, there exists a universal constant $\alpha > 0$ such that

Figures (7)

  • Figure 1: Interaction protocol for multi-armed bandits with fixed-regret and fixed-reward abstention.
  • Figure 2: Empirical regrets with abstention regret $c=0.1$ for different time horizons $T$.
  • Figure 3: Empirical regrets with time horizon $T=10,000$ for different abstention regrets $c$.
  • Figure 4: Empirical regrets with abstention reward $c=0.9$ for different time horizons $T$.
  • Figure 5: Empirical regrets with time horizon $T=10,000$ for different abstention rewards $c$.
  • ...and 2 more figures

Theorems & Definitions (29)

  • Remark 2.1
  • Remark 2.2
  • Remark 3.1
  • Remark 3.2
  • Theorem 3.3
  • Definition 3.4: $R^{\mathrm{RG}}$-consistency
  • Theorem 3.5
  • Remark 3.6
  • Theorem 4.1
  • Remark 4.2
  • ...and 19 more