Table of Contents
Fetching ...

Identifying the Best Transition Law

Mehrasa Ahmadipour, élise Crepon, Aurélien Garivier

TL;DR

This work tackles best-arm identification in bandits where each arm's reward follows a multinomial distribution with a known support, comparing a non-structured LUCB baseline against structured variants that exploit the known support. The authors develop two structured approaches: Structured-LUCB, which uses per-component confidence bounds on the probability vector and aggregates them to bounds on the expected reward, and EL-LUCB, which employs Empirical Likelihood via KL-based confidence regions on the joint probability vector. Through simulations across scenarios with varying structural complexity, they show that exploiting structure can yield substantial gains in sample efficiency in some regimes (notably when outcomes concentrate on specific support elements), while in other regimes the non-structured method may perform better. The study also analyzes how the choice of confidence mechanisms (Hoeffding vs Bernstein) and the complexity of the support impact stopping times and computational load, highlighting a trade-off between accuracy, efficiency, and tractability in structured BAI. These insights inform when to leverage known distributional structure in practical decision-making under uncertainty.

Abstract

Motivated by recursive learning in Markov Decision Processes, this paper studies best-arm identification in bandit problems where each arm's reward is drawn from a multinomial distribution with a known support. We compare the performance { reached by strategies including notably LUCB without and with use of this knowledge. } In the first case, we use classical non-parametric approaches for the confidence intervals. In the second case, where a probability distribution is to be estimated, we first use classical deviation bounds (Hoeffding and Bernstein) on each dimension independently, and then the Empirical Likelihood method (EL-LUCB) on the joint probability vector. The effectiveness of these methods is demonstrated through simulations on scenarios with varying levels of structural complexity.

Identifying the Best Transition Law

TL;DR

This work tackles best-arm identification in bandits where each arm's reward follows a multinomial distribution with a known support, comparing a non-structured LUCB baseline against structured variants that exploit the known support. The authors develop two structured approaches: Structured-LUCB, which uses per-component confidence bounds on the probability vector and aggregates them to bounds on the expected reward, and EL-LUCB, which employs Empirical Likelihood via KL-based confidence regions on the joint probability vector. Through simulations across scenarios with varying structural complexity, they show that exploiting structure can yield substantial gains in sample efficiency in some regimes (notably when outcomes concentrate on specific support elements), while in other regimes the non-structured method may perform better. The study also analyzes how the choice of confidence mechanisms (Hoeffding vs Bernstein) and the complexity of the support impact stopping times and computational load, highlighting a trade-off between accuracy, efficiency, and tractability in structured BAI. These insights inform when to leverage known distributional structure in practical decision-making under uncertainty.

Abstract

Motivated by recursive learning in Markov Decision Processes, this paper studies best-arm identification in bandit problems where each arm's reward is drawn from a multinomial distribution with a known support. We compare the performance { reached by strategies including notably LUCB without and with use of this knowledge. } In the first case, we use classical non-parametric approaches for the confidence intervals. In the second case, where a probability distribution is to be estimated, we first use classical deviation bounds (Hoeffding and Bernstein) on each dimension independently, and then the Empirical Likelihood method (EL-LUCB) on the joint probability vector. The effectiveness of these methods is demonstrated through simulations on scenarios with varying levels of structural complexity.

Paper Structure

This paper contains 12 sections, 19 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: A learner at $s$ selects a color and transitions to $S^{\prime}$. Each color represents a probability vector, meaning that the likelihood of arriving at a destination varies depending on the chosen color.
  • Figure 2: Comparing the stopping times of two algorithms on $V^{\text{test1}}$
  • Figure 3: Comparing the stopping times of two algorithms on $V^{\text{test2}}$.
  • Figure 4: Comparing the stopping times of EL-LUCB algorithm on $V^{\text{test}1}$ (above) and $V^{\text{test}2}$ (below).
  • Figure 5: Comparing the stopping times of EL-LUCB algorithm on $V^\text{test3}$ with low range (above) and with $V^\text{test4}$ high range (below)

Theorems & Definitions (2)

  • Definition 1
  • Definition 2: Sample Complexity