Table of Contents
Fetching ...

Differential Good Arm Identification

Yun-Da Tsai, Tzu-Hsien Tsai, Shou-De Lin

TL;DR

This work addresses good arm identification (GAI) in stochastic bandits, aiming to identify many arms with mean rewards above a threshold $\xi$ using as few samples as possible. It introduces DGAI, a differentiable algorithm that learns adaptive confidence bounds via a differentiable UCB index, with separate training objectives for sampling and identification, and proves a $\delta$-PAC guarantee for the linear case. DGAI outperforms state-of-the-art baselines (e.g., HDoC, LUCB-G, APT-G) on synthetic and real-world datasets for GAI and can enhance cumulative reward maximization in MAB problems when a threshold is provided as prior knowledge. The approach delivers data-driven, problem-adaptive confidence bounds, leading to substantial improvements in sample efficiency and decision quality with potential extensions to non-linear settings.

Abstract

This paper targets a variant of the stochastic multi-armed bandit problem called good arm identification (GAI). GAI is a pure-exploration bandit problem with the goal to output as many good arms using as few samples as possible, where a good arm is defined as an arm whose expected reward is greater than a given threshold. In this work, we propose DGAI - a differentiable good arm identification algorithm to improve the sample complexity of the state-of-the-art HDoC algorithm in a data-driven fashion. We also showed that the DGAI can further boost the performance of a general multi-arm bandit (MAB) problem given a threshold as a prior knowledge to the arm set. Extensive experiments confirm that our algorithm outperform the baseline algorithms significantly in both synthetic and real world datasets for both GAI and MAB tasks.

Differential Good Arm Identification

TL;DR

This work addresses good arm identification (GAI) in stochastic bandits, aiming to identify many arms with mean rewards above a threshold using as few samples as possible. It introduces DGAI, a differentiable algorithm that learns adaptive confidence bounds via a differentiable UCB index, with separate training objectives for sampling and identification, and proves a -PAC guarantee for the linear case. DGAI outperforms state-of-the-art baselines (e.g., HDoC, LUCB-G, APT-G) on synthetic and real-world datasets for GAI and can enhance cumulative reward maximization in MAB problems when a threshold is provided as prior knowledge. The approach delivers data-driven, problem-adaptive confidence bounds, leading to substantial improvements in sample efficiency and decision quality with potential extensions to non-linear settings.

Abstract

This paper targets a variant of the stochastic multi-armed bandit problem called good arm identification (GAI). GAI is a pure-exploration bandit problem with the goal to output as many good arms using as few samples as possible, where a good arm is defined as an arm whose expected reward is greater than a given threshold. In this work, we propose DGAI - a differentiable good arm identification algorithm to improve the sample complexity of the state-of-the-art HDoC algorithm in a data-driven fashion. We also showed that the DGAI can further boost the performance of a general multi-arm bandit (MAB) problem given a threshold as a prior knowledge to the arm set. Extensive experiments confirm that our algorithm outperform the baseline algorithms significantly in both synthetic and real world datasets for both GAI and MAB tasks.
Paper Structure (31 sections, 6 theorems, 23 equations, 4 figures, 3 tables, 2 algorithms)

This paper contains 31 sections, 6 theorems, 23 equations, 4 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

Under any $(\lambda, \delta)$-PAC algorithm, if there are $m \ge \lambda$ good arms, then where $d(x,y) = x\log(x/y)+(1-x)\log((1-x)/(1-y))$ is the binary relative entropy, with convention that $d(0,0)=d(1,1)=0$.

Figures (4)

  • Figure 1: Comparison of our proposed method with several baselines on all datasets. The cumulative exploit score shows that our proposed method outperforms the baselines in solving GAI problem as the learned confidence bound w.r.t $\alpha,\beta$ converges. The performance in online setting converges slower but also eventually outperform other baselines.
  • Figure 2: Learning curves of $\alpha,\beta$. Solid lines represent offline setting and the dash line represent online setting. The horizontal axis in offline setting is training epochs while in online setting is sampling rounds $t\in[T]$. The curve shows that the parameters converges as the training epochs goes on and it converges slower in online setting.
  • Figure 3: Confidence bound comparison. This figure plots the identification bound for the best arm in the arm set $\mathcal{A}$ during the training trajectory and compare the difference between ours and the baselines.
  • Figure 4: Comparison of our proposed method with several baselines on all datasets. The cumulative reward shows that our proposed method outperforms the baselines in solving cumulative reward maximization problem as the learned confidence bound w.r.t $\alpha,\beta$ converges to optimal.

Theorems & Definitions (10)

  • Definition 1: $(\lambda, \delta)$-PAC
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Lemma 1
  • Lemma 2
  • Theorem 3
  • proof : Proof of Theorem \ref{['thm:PAC']}
  • Remark 1
  • Remark 2