Best Arm Identification with Minimal Regret

Junwen Yang; Vincent Y. F. Tan; Tianyuan Jin

Best Arm Identification with Minimal Regret

Junwen Yang, Vincent Y. F. Tan, Tianyuan Jin

TL;DR

Focusing on single-parameter exponential families of distributions, this work designs and analyzes the Double KL-UCB algorithm, which achieves asymptotic optimality as the confidence level tends to zero, and elucidate a fresh perspective on the inherent connections between regret minimization and BAI.

Abstract

Motivated by real-world applications that necessitate responsible experimentation, we introduce the problem of best arm identification (BAI) with minimal regret. This innovative variant of the multi-armed bandit problem elegantly amalgamates two of its most ubiquitous objectives: regret minimization and BAI. More precisely, the agent's goal is to identify the best arm with a prescribed confidence level $δ$, while minimizing the cumulative regret up to the stopping time. Focusing on single-parameter exponential families of distributions, we leverage information-theoretic techniques to establish an instance-dependent lower bound on the expected cumulative regret. Moreover, we present an intriguing impossibility result that underscores the tension between cumulative regret and sample complexity in fixed-confidence BAI. Complementarily, we design and analyze the Double KL-UCB algorithm, which achieves asymptotic optimality as the confidence level tends to zero. Notably, this algorithm employs two distinct confidence bounds to guide arm selection in a randomized manner. Our findings elucidate a fresh perspective on the inherent connections between regret minimization and BAI.

Best Arm Identification with Minimal Regret

TL;DR

Abstract

, while minimizing the cumulative regret up to the stopping time. Focusing on single-parameter exponential families of distributions, we leverage information-theoretic techniques to establish an instance-dependent lower bound on the expected cumulative regret. Moreover, we present an intriguing impossibility result that underscores the tension between cumulative regret and sample complexity in fixed-confidence BAI. Complementarily, we design and analyze the Double KL-UCB algorithm, which achieves asymptotic optimality as the confidence level tends to zero. Notably, this algorithm employs two distinct confidence bounds to guide arm selection in a randomized manner. Our findings elucidate a fresh perspective on the inherent connections between regret minimization and BAI.

Paper Structure (27 sections, 15 theorems, 138 equations, 1 algorithm)

This paper contains 27 sections, 15 theorems, 138 equations, 1 algorithm.

Introduction
Main contributions.
Related work.
Problem Setup and Preliminaries
Multi-armed bandits.
Best arm identification with minimal regret.
Exponential families.
Other notations.
Lower Bound
The Double KL-UCB Algorithm
Theoretical Analysis of DKL-UCB
Main Results
Technical Challenges and Proof Outline
Discussion: Comparisons to Related Problems
Cumulative regret minimization.
...and 12 more sections

Key Result

Theorem 3

For a fixed confidence level $\delta\in (0,1)$ and instance $\bm{\mu} \in \mathcal{M}$, any $\delta$-PAC BAI algorithm satisfies that where Furthermore,

Theorems & Definitions (20)

Definition 1
Remark 2
Theorem 3: Information-theoretic lower bound
Definition 4: Asymptotic optimality
Theorem 5: Impossibility result
Remark 6
Theorem 7
Example 1
Lemma 8: menard2017minimax
Lemma 9: Maximal Inequality menard2017minimax
...and 10 more

Best Arm Identification with Minimal Regret

TL;DR

Abstract

Best Arm Identification with Minimal Regret

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (20)