Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk

David Simchi-Levi; Zeyu Zheng; Feng Zhu

Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk

David Simchi-Levi, Zeyu Zheng, Feng Zhu

TL;DR

This work addresses how to balance regret expectation with tail risk in stochastic multi-armed bandits, revealing a fundamental trade-off between worst-case optimality, instance-dependent consistency, and light-tailed risk. It introduces new UCB-based policies with two-part bonus terms that achieve optimal tail decay under both known and unknown horizons, and shows a horizon-knowledge gap only in the instance-dependent regime. The authors extend the framework to sub-exponential environments and linear bandits, maintaining safe, robust tail behavior, and uncover a conceptual link to AlphaGo's Monte Carlo Tree Search strategy. These findings provide actionable guidance for designing exploration-exploitation policies that are both efficient and safe in uncertain environments, with implications for broader reinforcement learning contexts.

Abstract

We study the optimal trade-off between expectation and tail risk for regret distribution in the stochastic multi-armed bandit model. We fully characterize the interplay among three desired properties for policy design: worst-case optimality, instance-dependent consistency, and light-tailed risk. New policies are proposed to characterize the optimal regret tail probability for any regret threshold. In particular, we discover an intrinsic gap of the optimal tail rate depending on whether the time horizon $T$ is known a priori or not. Interestingly, when it comes to the purely worst-case scenario, this gap disappears. Our results reveal insights on how to design policies that balance between efficiency and safety, and highlight extra insights on policy robustness with regard to policy hyper-parameters and model mis-specification. We also conduct a simulation study to validate our theoretical insights and provide practical amendment to our policies. Finally, we discuss extensions of our results to (i) general sub-exponential environments and (ii) general stochastic linear bandits. Furthermore, we find that a special case of our policy design surprisingly coincides with what was adopted in AlphaGo Monte Carlo Tree Search. Our theory provides high-level insights to why their engineered solution is successful and should be advocated in complex decision-making environments.

Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk

TL;DR

Abstract

is known a priori or not. Interestingly, when it comes to the purely worst-case scenario, this gap disappears. Our results reveal insights on how to design policies that balance between efficiency and safety, and highlight extra insights on policy robustness with regard to policy hyper-parameters and model mis-specification. We also conduct a simulation study to validate our theoretical insights and provide practical amendment to our policies. Finally, we discuss extensions of our results to (i) general sub-exponential environments and (ii) general stochastic linear bandits. Furthermore, we find that a special case of our policy design surprisingly coincides with what was adopted in AlphaGo Monte Carlo Tree Search. Our theory provides high-level insights to why their engineered solution is successful and should be advocated in complex decision-making environments.

Paper Structure (18 sections, 17 theorems, 173 equations, 2 figures, 1 table)

This paper contains 18 sections, 17 theorems, 173 equations, 2 figures, 1 table.

Introduction
Our Contributions
Related Work
Organization and Notation
The Setup
Regret Expectation and Tail Risk
Tail Lower Bound: The Best to Hope
Tail Upper Bound: The Best to Achieve
The Fixed-time Design
The Any-time Design
Generalization and Extensions
Robustness in Sub-Exponential Environments
Extension to Linear Bandits
Implications on Reinforcement Learning: AlphaGo
Conclusion
...and 3 more sections

Key Result

Lemma 1

We have $\mathbb E[N^\pi(T)] = 0$ and

Figures (2)

Figure 1: regret expectation vs. tail risk for $(\mathcal{N}(0.1, 1), \mathcal{N}(-0.1, 1))$
Figure 2: MCTS procedure of one simulation in AlphaGo

Theorems & Definitions (17)

Lemma 1
Theorem 1
Lemma 2
Proposition 1
Corollary 1
Theorem 2
Corollary 2
Proposition 2
Theorem 3
Corollary 3
...and 7 more

Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk

TL;DR

Abstract

Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (17)