Table of Contents
Fetching ...

Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk

David Simchi-Levi, Zeyu Zheng, Feng Zhu

TL;DR

This work addresses how to balance regret expectation with tail risk in stochastic multi-armed bandits, revealing a fundamental trade-off between worst-case optimality, instance-dependent consistency, and light-tailed risk. It introduces new UCB-based policies with two-part bonus terms that achieve optimal tail decay under both known and unknown horizons, and shows a horizon-knowledge gap only in the instance-dependent regime. The authors extend the framework to sub-exponential environments and linear bandits, maintaining safe, robust tail behavior, and uncover a conceptual link to AlphaGo's Monte Carlo Tree Search strategy. These findings provide actionable guidance for designing exploration-exploitation policies that are both efficient and safe in uncertain environments, with implications for broader reinforcement learning contexts.

Abstract

We study the optimal trade-off between expectation and tail risk for regret distribution in the stochastic multi-armed bandit model. We fully characterize the interplay among three desired properties for policy design: worst-case optimality, instance-dependent consistency, and light-tailed risk. New policies are proposed to characterize the optimal regret tail probability for any regret threshold. In particular, we discover an intrinsic gap of the optimal tail rate depending on whether the time horizon $T$ is known a priori or not. Interestingly, when it comes to the purely worst-case scenario, this gap disappears. Our results reveal insights on how to design policies that balance between efficiency and safety, and highlight extra insights on policy robustness with regard to policy hyper-parameters and model mis-specification. We also conduct a simulation study to validate our theoretical insights and provide practical amendment to our policies. Finally, we discuss extensions of our results to (i) general sub-exponential environments and (ii) general stochastic linear bandits. Furthermore, we find that a special case of our policy design surprisingly coincides with what was adopted in AlphaGo Monte Carlo Tree Search. Our theory provides high-level insights to why their engineered solution is successful and should be advocated in complex decision-making environments.

Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk

TL;DR

This work addresses how to balance regret expectation with tail risk in stochastic multi-armed bandits, revealing a fundamental trade-off between worst-case optimality, instance-dependent consistency, and light-tailed risk. It introduces new UCB-based policies with two-part bonus terms that achieve optimal tail decay under both known and unknown horizons, and shows a horizon-knowledge gap only in the instance-dependent regime. The authors extend the framework to sub-exponential environments and linear bandits, maintaining safe, robust tail behavior, and uncover a conceptual link to AlphaGo's Monte Carlo Tree Search strategy. These findings provide actionable guidance for designing exploration-exploitation policies that are both efficient and safe in uncertain environments, with implications for broader reinforcement learning contexts.

Abstract

We study the optimal trade-off between expectation and tail risk for regret distribution in the stochastic multi-armed bandit model. We fully characterize the interplay among three desired properties for policy design: worst-case optimality, instance-dependent consistency, and light-tailed risk. New policies are proposed to characterize the optimal regret tail probability for any regret threshold. In particular, we discover an intrinsic gap of the optimal tail rate depending on whether the time horizon is known a priori or not. Interestingly, when it comes to the purely worst-case scenario, this gap disappears. Our results reveal insights on how to design policies that balance between efficiency and safety, and highlight extra insights on policy robustness with regard to policy hyper-parameters and model mis-specification. We also conduct a simulation study to validate our theoretical insights and provide practical amendment to our policies. Finally, we discuss extensions of our results to (i) general sub-exponential environments and (ii) general stochastic linear bandits. Furthermore, we find that a special case of our policy design surprisingly coincides with what was adopted in AlphaGo Monte Carlo Tree Search. Our theory provides high-level insights to why their engineered solution is successful and should be advocated in complex decision-making environments.
Paper Structure (18 sections, 17 theorems, 173 equations, 2 figures, 1 table)

This paper contains 18 sections, 17 theorems, 173 equations, 2 figures, 1 table.

Key Result

Lemma 1

We have $\mathbb E[N^\pi(T)] = 0$ and

Figures (2)

  • Figure 1: regret expectation vs. tail risk for $(\mathcal{N}(0.1, 1), \mathcal{N}(-0.1, 1))$
  • Figure 2: MCTS procedure of one simulation in AlphaGo

Theorems & Definitions (17)

  • Lemma 1
  • Theorem 1
  • Lemma 2
  • Proposition 1
  • Corollary 1
  • Theorem 2
  • Corollary 2
  • Proposition 2
  • Theorem 3
  • Corollary 3
  • ...and 7 more