Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits
Ronshee Chawla, Daniel Vial, Sanjay Shakkottai, R. Srikant
TL;DR
This work extends decentralized multi-armed bandit theory to a setting where $N$ agents learn $M$ heterogeneous bandits in a fully distributed fashion. It introduces two collaboration regimes—context unaware and partially context aware—and develops GosInE-based algorithms that propagate optimal arms through a gossip-based network. The authors derive per-agent and group regret upper bounds, prove matching lower bounds, and show that sharing best-arm information among groups reduces regret, especially when agents know $r-1$ peers learning the same bandit. The results demonstrate near-optimal performance for distributed exploration across multiple bandits and provide insights into how communication structure and local cooperation affect collective learning efficiency in decentralized systems.
Abstract
The study of collaborative multi-agent bandits has attracted significant attention recently. In light of this, we initiate the study of a new collaborative setting, consisting of $N$ agents such that each agent is learning one of $M$ stochastic multi-armed bandits to minimize their group cumulative regret. We develop decentralized algorithms which facilitate collaboration between the agents under two scenarios. We characterize the performance of these algorithms by deriving the per agent cumulative regret and group regret upper bounds. We also prove lower bounds for the group regret in this setting, which demonstrates the near-optimal behavior of the proposed algorithms.
