Table of Contents
Fetching ...

Pure Exploration under Mediators' Feedback

Riccardo Poiani, Alberto Maria Metelli, Marcello Restelli

TL;DR

The paper introduces best-arm identification under mediators' feedback (BAI-MF), a generalization of fixed-confidence BAI where mediators with known or unknown policies select arms on the agent’s behalf. It derives a mediator-aware information-theoretic lower bound on sample complexity and proposes a Track-and-Stop style algorithm that tracks mediator-proportions to achieve this bound when policies are known, with extensions to unknown mediator policies. Theoretical results show asymptotic optimality (in both almost-sure and expected senses) under the action-covering assumption, and the approach reduces to classical BAI when mediators provide direct arm pulls. Empirically and theoretically, the work links mediator-based feedback to off-policy learning, offering a principled framework for efficient pure exploration in partially controllable or human-in-the-loop settings.

Abstract

Stochastic multi-armed bandits are a sequential-decision-making framework, where, at each interaction step, the learner selects an arm and observes a stochastic reward. Within the context of best-arm identification (BAI) problems, the goal of the agent lies in finding the optimal arm, i.e., the one with highest expected reward, as accurately and efficiently as possible. Nevertheless, the sequential interaction protocol of classical BAI problems, where the agent has complete control over the arm being pulled at each round, does not effectively model several decision-making problems of interest (e.g., off-policy learning, partially controllable environments, and human feedback). For this reason, in this work, we propose a novel strict generalization of the classical BAI problem that we refer to as best-arm identification under mediators' feedback (BAI-MF). More specifically, we consider the scenario in which the learner has access to a set of mediators, each of which selects the arms on the agent's behalf according to a stochastic and possibly unknown policy. The mediator, then, communicates back to the agent the pulled arm together with the observed reward. In this setting, the agent's goal lies in sequentially choosing which mediator to query to identify with high probability the optimal arm while minimizing the identification time, i.e., the sample complexity. To this end, we first derive and analyze a statistical lower bound on the sample complexity specific to our general mediator feedback scenario. Then, we propose a sequential decision-making strategy for discovering the best arm under the assumption that the mediators' policies are known to the learner. As our theory verifies, this algorithm matches the lower bound both almost surely and in expectation. Finally, we extend these results to cases where the mediators' policies are unknown to the learner obtaining comparable results.

Pure Exploration under Mediators' Feedback

TL;DR

The paper introduces best-arm identification under mediators' feedback (BAI-MF), a generalization of fixed-confidence BAI where mediators with known or unknown policies select arms on the agent’s behalf. It derives a mediator-aware information-theoretic lower bound on sample complexity and proposes a Track-and-Stop style algorithm that tracks mediator-proportions to achieve this bound when policies are known, with extensions to unknown mediator policies. Theoretical results show asymptotic optimality (in both almost-sure and expected senses) under the action-covering assumption, and the approach reduces to classical BAI when mediators provide direct arm pulls. Empirically and theoretically, the work links mediator-based feedback to off-policy learning, offering a principled framework for efficient pure exploration in partially controllable or human-in-the-loop settings.

Abstract

Stochastic multi-armed bandits are a sequential-decision-making framework, where, at each interaction step, the learner selects an arm and observes a stochastic reward. Within the context of best-arm identification (BAI) problems, the goal of the agent lies in finding the optimal arm, i.e., the one with highest expected reward, as accurately and efficiently as possible. Nevertheless, the sequential interaction protocol of classical BAI problems, where the agent has complete control over the arm being pulled at each round, does not effectively model several decision-making problems of interest (e.g., off-policy learning, partially controllable environments, and human feedback). For this reason, in this work, we propose a novel strict generalization of the classical BAI problem that we refer to as best-arm identification under mediators' feedback (BAI-MF). More specifically, we consider the scenario in which the learner has access to a set of mediators, each of which selects the arms on the agent's behalf according to a stochastic and possibly unknown policy. The mediator, then, communicates back to the agent the pulled arm together with the observed reward. In this setting, the agent's goal lies in sequentially choosing which mediator to query to identify with high probability the optimal arm while minimizing the identification time, i.e., the sample complexity. To this end, we first derive and analyze a statistical lower bound on the sample complexity specific to our general mediator feedback scenario. Then, we propose a sequential decision-making strategy for discovering the best arm under the assumption that the mediators' policies are known to the learner. As our theory verifies, this algorithm matches the lower bound both almost surely and in expectation. Finally, we extend these results to cases where the mediators' policies are unknown to the learner obtaining comparable results.
Paper Structure (15 sections, 24 theorems, 80 equations, 2 tables)

This paper contains 15 sections, 24 theorems, 80 equations, 2 tables.

Key Result

Theorem 1

Let $\delta \in (0,1)$. For any $\delta$-correct strategy, any bandit model $\boldsymbol{\mu}$, and any set of mediators $\boldsymbol{\pi}$ it holds that $\mathbb{E}_{\boldsymbol{\mu}, \boldsymbol{\pi}} \left[ \tau_\delta \right] \ge \textup{kl}(\delta, 1-\delta) T^*(\boldsymbol{\mu}, \boldsymbol{\p

Theorems & Definitions (24)

  • Theorem 1
  • Proposition 1
  • Theorem 2
  • Theorem 3
  • Proposition 3
  • Theorem 4
  • Theorem 5
  • Theorem 5
  • Proposition 5
  • Proposition 5
  • ...and 14 more