Table of Contents
Fetching ...

Optimal Best Arm Identification with Post-Action Context

Mohammad Shahverdikondori, Amir Mohammad Abouei, Alireza Rezaeimoghadam, Negar Kiyavash

TL;DR

This work studies best arm identification under a novel post-action context in fixed-confidence stochastic MABs, distinguishing non-separator and separator forms. It derives instance-dependent lower bounds and develops asymptotically optimal algorithms: NSTS for non-separator using D-tracking and GLR-based stopping, and STS with G-tracking for the separator setting, leveraging context geometry rather than actions to guide sampling. The methods achieve the lower bounds up to vanishing terms, demonstrated both analytically and empirically, with clear performance gains over traditional track-and-stop baselines. The results highlight the practical impact of post-action feedback in reducing sample complexity for BAI across domains where intermediate context informs rewards.

Abstract

We introduce the problem of best arm identification (BAI) with post-action context, a new BAI problem in a stochastic multi-armed bandit environment and the fixed-confidence setting. The problem addresses the scenarios in which the learner receives a $\textit{post-action context}$ in addition to the reward after playing each action. This post-action context provides additional information that can significantly facilitate the decision process. We analyze two different types of the post-action context: (i) $\textit{non-separator}$, where the reward depends on both the action and the context, and (ii) $\textit{separator}$, where the reward depends solely on the context. For both cases, we derive instance-dependent lower bounds on the sample complexity and propose algorithms that asymptotically achieve the optimal sample complexity. For the non-separator setting, we do so by demonstrating that the Track-and-Stop algorithm can be extended to this setting. For the separator setting, we propose a novel sampling rule called $\textit{G-tracking}$, which uses the geometry of the context space to directly track the contexts rather than the actions. Finally, our empirical results showcase the advantage of our approaches compared to the state of the art.

Optimal Best Arm Identification with Post-Action Context

TL;DR

This work studies best arm identification under a novel post-action context in fixed-confidence stochastic MABs, distinguishing non-separator and separator forms. It derives instance-dependent lower bounds and develops asymptotically optimal algorithms: NSTS for non-separator using D-tracking and GLR-based stopping, and STS with G-tracking for the separator setting, leveraging context geometry rather than actions to guide sampling. The methods achieve the lower bounds up to vanishing terms, demonstrated both analytically and empirically, with clear performance gains over traditional track-and-stop baselines. The results highlight the practical impact of post-action feedback in reducing sample complexity for BAI across domains where intermediate context informs rewards.

Abstract

We introduce the problem of best arm identification (BAI) with post-action context, a new BAI problem in a stochastic multi-armed bandit environment and the fixed-confidence setting. The problem addresses the scenarios in which the learner receives a in addition to the reward after playing each action. This post-action context provides additional information that can significantly facilitate the decision process. We analyze two different types of the post-action context: (i) , where the reward depends on both the action and the context, and (ii) , where the reward depends solely on the context. For both cases, we derive instance-dependent lower bounds on the sample complexity and propose algorithms that asymptotically achieve the optimal sample complexity. For the non-separator setting, we do so by demonstrating that the Track-and-Stop algorithm can be extended to this setting. For the separator setting, we propose a novel sampling rule called , which uses the geometry of the context space to directly track the contexts rather than the actions. Finally, our empirical results showcase the advantage of our approaches compared to the state of the art.

Paper Structure

This paper contains 39 sections, 23 theorems, 159 equations, 7 figures, 2 tables, 2 algorithms.

Key Result

Proposition 3.1

For any bandit environment with parameter $\mathcal{A}$ and $\mu \in \mathcal{I}(\mathcal{A})$ and any $\delta$-correct algorithm, $\mathbb{E}_{\boldsymbol{\mu}, \mathcal{A}}[\tau_{\delta}] \geq T^*(\boldsymbol{\mu}, \mathcal{A})d_B(\delta, 1- \delta)$, where $T^*(\boldsymbol{\mu}, \mathcal{A})$ is

Figures (7)

  • Figure 1: Two possible structures for the post-action context.
  • Figure 2: Illustration of G-tracking rule for an instance with $k=3, n=5$. The triangle depicts the two-dimensional simplex, and the green area shows the policy space $\text{conv}(\mathcal{A})$.
  • Figure 3: The results of different algorithms for an instance in Equation \ref{['eq: sep-instance']}.
  • Figure 4: Illustration of the relative positions of points in the proof of Lemma \ref{['lem: G-tracking']}, where all points lie in $\Delta^{k-1}$.
  • Figure 5: Comparison of the $L^2$ distance of the frequencies of pulled arms and the optimal frequency over time between two algorithms.
  • ...and 2 more figures

Theorems & Definitions (36)

  • Definition 2.1: $\delta$-correct
  • Proposition 3.1
  • Theorem 4.1: Non-Separator Lower Bound
  • Lemma 4.1
  • Theorem 4.2: Non-Separator Upper Bound
  • Theorem 5.1: Separator Lower Bound
  • Lemma 5.1
  • Theorem 5.2: Separator Upper Bound
  • Definition 4.1
  • Lemma 4.1
  • ...and 26 more