Bayesian Decision Making around Experts
Daniel Jarne Ornia, Joel Dyer, Nicholas Bishop, Anisoara Calinescu, Michael Wooldridge
TL;DR
This work addresses Bayesian online learning in multi-armed bandits when an expert coexists with the learner, analyzing two settings: offline expert data and simultaneous, online expert observations. It introduces an information-theoretic framework that yields a consistent Bayesian update with expert data, tightens regret bounds via mutual information with the optimal action, and presents an information-directed rule for choosing between self and expert data sources. The authors also model imperfect or adversarial experts, proposing a trust-learning mechanism that estimates expert reliability and adapts data processing accordingly. Empirical results demonstrate substantial regret reductions in asymmetric worlds and illustrate when to rely on or discount expert information. The framework provides practical, principled guidance for robust, data-source-aware Bayesian learning in multi-agent environments.
Abstract
Complex learning agents are increasingly deployed alongside existing experts, such as human operators or previously trained agents. However, it remains unclear how should learners optimally incorporate certain forms of expert data, which may differ in structure from the learner's own action-outcome experiences. We study this problem in the context of Bayesian multi-armed bandits, considering: (i) offline settings, where the learner receives a dataset of outcomes from the expert's optimal policy before interaction, and (ii) simultaneous settings, where the learner must choose at each step whether to update its beliefs based on its own experience, or based on the outcome simultaneously achieved by an expert. We formalize how expert data influences the learner's posterior, and prove that pretraining on expert outcomes tightens information-theoretic regret bounds by the mutual information between the expert data and the optimal action. For the simultaneous setting, we propose an information-directed rule where the learner processes the data source that maximizes their one-step information gain about the optimal action. Finally, we propose strategies for how the learner can infer when to trust the expert and when not to, safeguarding the learner for the cases where the expert is ineffective or compromised. By quantifying the value of expert data, our framework provides practical, information-theoretic algorithms for agents to intelligently decide when to learn from others.
