Table of Contents
Fetching ...

Bayesian Decision Making around Experts

Daniel Jarne Ornia, Joel Dyer, Nicholas Bishop, Anisoara Calinescu, Michael Wooldridge

TL;DR

This work addresses Bayesian online learning in multi-armed bandits when an expert coexists with the learner, analyzing two settings: offline expert data and simultaneous, online expert observations. It introduces an information-theoretic framework that yields a consistent Bayesian update with expert data, tightens regret bounds via mutual information with the optimal action, and presents an information-directed rule for choosing between self and expert data sources. The authors also model imperfect or adversarial experts, proposing a trust-learning mechanism that estimates expert reliability and adapts data processing accordingly. Empirical results demonstrate substantial regret reductions in asymmetric worlds and illustrate when to rely on or discount expert information. The framework provides practical, principled guidance for robust, data-source-aware Bayesian learning in multi-agent environments.

Abstract

Complex learning agents are increasingly deployed alongside existing experts, such as human operators or previously trained agents. However, it remains unclear how should learners optimally incorporate certain forms of expert data, which may differ in structure from the learner's own action-outcome experiences. We study this problem in the context of Bayesian multi-armed bandits, considering: (i) offline settings, where the learner receives a dataset of outcomes from the expert's optimal policy before interaction, and (ii) simultaneous settings, where the learner must choose at each step whether to update its beliefs based on its own experience, or based on the outcome simultaneously achieved by an expert. We formalize how expert data influences the learner's posterior, and prove that pretraining on expert outcomes tightens information-theoretic regret bounds by the mutual information between the expert data and the optimal action. For the simultaneous setting, we propose an information-directed rule where the learner processes the data source that maximizes their one-step information gain about the optimal action. Finally, we propose strategies for how the learner can infer when to trust the expert and when not to, safeguarding the learner for the cases where the expert is ineffective or compromised. By quantifying the value of expert data, our framework provides practical, information-theoretic algorithms for agents to intelligently decide when to learn from others.

Bayesian Decision Making around Experts

TL;DR

This work addresses Bayesian online learning in multi-armed bandits when an expert coexists with the learner, analyzing two settings: offline expert data and simultaneous, online expert observations. It introduces an information-theoretic framework that yields a consistent Bayesian update with expert data, tightens regret bounds via mutual information with the optimal action, and presents an information-directed rule for choosing between self and expert data sources. The authors also model imperfect or adversarial experts, proposing a trust-learning mechanism that estimates expert reliability and adapts data processing accordingly. Empirical results demonstrate substantial regret reductions in asymmetric worlds and illustrate when to rely on or discount expert information. The framework provides practical, principled guidance for robust, data-source-aware Bayesian learning in multi-agent environments.

Abstract

Complex learning agents are increasingly deployed alongside existing experts, such as human operators or previously trained agents. However, it remains unclear how should learners optimally incorporate certain forms of expert data, which may differ in structure from the learner's own action-outcome experiences. We study this problem in the context of Bayesian multi-armed bandits, considering: (i) offline settings, where the learner receives a dataset of outcomes from the expert's optimal policy before interaction, and (ii) simultaneous settings, where the learner must choose at each step whether to update its beliefs based on its own experience, or based on the outcome simultaneously achieved by an expert. We formalize how expert data influences the learner's posterior, and prove that pretraining on expert outcomes tightens information-theoretic regret bounds by the mutual information between the expert data and the optimal action. For the simultaneous setting, we propose an information-directed rule where the learner processes the data source that maximizes their one-step information gain about the optimal action. Finally, we propose strategies for how the learner can infer when to trust the expert and when not to, safeguarding the learner for the cases where the expert is ineffective or compromised. By quantifying the value of expert data, our framework provides practical, information-theoretic algorithms for agents to intelligently decide when to learn from others.

Paper Structure

This paper contains 51 sections, 4 theorems, 49 equations, 5 figures, 2 algorithms.

Key Result

Proposition 1

Assume a countable set $\Theta$. As the number of samples increases $N\to\infty$, the posterior update in eq:posteriorfinite converges to the infinite data update in eq:posterior_inf. In other words,

Figures (5)

  • Figure 1: Regret obtained by TS agents with expert data in symmetric bandits. Left: Pretraining with expert samples. Right: Selecting information sources.
  • Figure 2: Regret obtained by TS agents with expert data in asymmetric bandits. Left: Pretraining with expert samples. Right: Selecting information sources.
  • Figure 3: Regret obtained by TS agents with expert data in strongly asymmetric bandits. Left: Pretraining with expert samples. Right: Selecting information sources.
  • Figure 4: Regret obtained by MI estimating agents in symmetric (right) and asymmetric bandits (left).
  • Figure 5: Regret obtained by TS agents with mistaken or adversarial expert data.

Theorems & Definitions (9)

  • Proposition 1
  • Theorem 1: Regret Reduction from Offline Expert Data
  • Remark 1
  • Proposition 2
  • Corollary 1: Symmetric Worlds
  • proof : Proposition \ref{['prop:1']}
  • proof : Theorem \ref{['thm:regret_reduction']} (Regret Reduction from Offline Expert Data)
  • proof : Proposition \ref{['prop:zero-info']}
  • proof : Corollary \ref{['cor:sym']}