Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

Ronshee Chawla; Daniel Vial; Sanjay Shakkottai; R. Srikant

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

Ronshee Chawla, Daniel Vial, Sanjay Shakkottai, R. Srikant

TL;DR

This work extends decentralized multi-armed bandit theory to a setting where $N$ agents learn $M$ heterogeneous bandits in a fully distributed fashion. It introduces two collaboration regimes—context unaware and partially context aware—and develops GosInE-based algorithms that propagate optimal arms through a gossip-based network. The authors derive per-agent and group regret upper bounds, prove matching lower bounds, and show that sharing best-arm information among groups reduces regret, especially when agents know $r-1$ peers learning the same bandit. The results demonstrate near-optimal performance for distributed exploration across multiple bandits and provide insights into how communication structure and local cooperation affect collective learning efficiency in decentralized systems.

Abstract

The study of collaborative multi-agent bandits has attracted significant attention recently. In light of this, we initiate the study of a new collaborative setting, consisting of $N$ agents such that each agent is learning one of $M$ stochastic multi-armed bandits to minimize their group cumulative regret. We develop decentralized algorithms which facilitate collaboration between the agents under two scenarios. We characterize the performance of these algorithms by deriving the per agent cumulative regret and group regret upper bounds. We also prove lower bounds for the group regret in this setting, which demonstrates the near-optimal behavior of the proposed algorithms.

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

TL;DR

This work extends decentralized multi-armed bandit theory to a setting where

agents learn

heterogeneous bandits in a fully distributed fashion. It introduces two collaboration regimes—context unaware and partially context aware—and develops GosInE-based algorithms that propagate optimal arms through a gossip-based network. The authors derive per-agent and group regret upper bounds, prove matching lower bounds, and show that sharing best-arm information among groups reduces regret, especially when agents know

peers learning the same bandit. The results demonstrate near-optimal performance for distributed exploration across multiple bandits and provide insights into how communication structure and local cooperation affect collective learning efficiency in decentralized systems.

Abstract

The study of collaborative multi-agent bandits has attracted significant attention recently. In light of this, we initiate the study of a new collaborative setting, consisting of

agents such that each agent is learning one of

stochastic multi-armed bandits to minimize their group cumulative regret. We develop decentralized algorithms which facilitate collaboration between the agents under two scenarios. We characterize the performance of these algorithms by deriving the per agent cumulative regret and group regret upper bounds. We also prove lower bounds for the group regret in this setting, which demonstrates the near-optimal behavior of the proposed algorithms.

Paper Structure (28 sections, 24 theorems, 79 equations, 2 figures, 4 algorithms)

This paper contains 28 sections, 24 theorems, 79 equations, 2 figures, 4 algorithms.

Introduction
Key Contributions
Related Work
Problem Setup
Context Unaware Algorithm
Key Algorithmic Principles
Algorithm Description
Assumption on the Sticky Set
Regret Guarantee
Proof Sketch (Theorem \ref{['thm:gosinereg']})
Partially Context Aware Algorithm
Regret Guarantee
Proof Sketch (Theorem \ref{['thm:gosinesideinforeg']})
Lower Bounds
Numerical Results
...and 13 more sections

Key Result

Theorem 1

Consider a system of $N \geq 2$ agents connected by a complete graph (for each $i \in [N]$, $G(i,n) = (N-1)^{-1} \forall n \neq i$) and learning one of the $M \geq 2$ bandits with $K \geq 2$ arms, satisfying Assumption assume:stickyset. Let the UCB parameter $\alpha > 10$ and the phase parameter $\b where $\tau^{*} = 2\max\{2, \max_{m \in [M]}\tau_{m}^{*}\}$, $\tau_{m}^{*} = \inf \left\{j \in \mat

Figures (2)

Figure 1: $(K, M, N, r)$ are $(20, 5, 25, 5)$ and $(30, 6, 36, 6)$ respectively. Arm means are in $[0, 1)$ and the UCB parameter $\alpha=15$.
Figure 2: $(K, M, N, r)$ are $(20, 5, 25, 5)$ and $(30, 6, 36, 6)$ respectively. Arm means are in $[2, 4)$ and the UCB parameter $\alpha=30$.

Theorems & Definitions (38)

Theorem 1
Corollary 2
Corollary 3
Theorem 4
Corollary 5
Corollary 6
Theorem 7
Theorem 8
Proposition 1
proof
...and 28 more

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

TL;DR

Abstract

Collaborative Multi-Agent Heterogeneous Multi-Armed Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (38)