Table of Contents
Fetching ...

Individual Regret in Cooperative Stochastic Multi-Armed Bandits

Idan Barnea, Tal Lancewicki, Yishay Mansour

TL;DR

This paper analyzes cooperative stochastic MAB on graphs with $m$ agents and $A$ arms, introducing Coop-SE for decentralized, message-passing learning. It proves diameter-free per-agent regret bounds of ${O}(\mathcal{R}/m + A^2 + A\sqrt{\log(T)})$ (and minimax ${O}(\sqrt{TA\log T / m} + A^2 + A\sqrt{\log T})$), along with matching lower bounds and extensions under CONGEST and limited-round communication. The results show that cooperation can dramatically reduce individual regret without depending on graph diameter, while maintaining robustness under constrained communication. The work also provides lower bounds, discusses random-action variants, and outlines several practical directions for reducing communication further while preserving performance. Overall, the study advances distributed learning in stochastic MABs by delivering diameter-free, near-optimal, communication-aware guarantees with broad implications for networked decision-making systems.

Abstract

We study the regret in stochastic Multi-Armed Bandits (MAB) with multiple agents that communicate over an arbitrary connected communication graph. We analyzed a variant of Cooperative Successive Elimination algorithm, $\coopse$, and show an individual regret bound of ${O}(\mathcal{R} / m + A^2 + A \sqrt{\log T})$ and a nearly matching lower bound. Here $A$ is the number of actions, $T$ the time horizon, $m$ the number of agents, and $\mathcal{R} = \sum_{Δ_i > 0}\log(T)/Δ_i$ is the optimal single agent regret, where $Δ_i$ is the sub-optimality gap of action $i$. Our work is the first to show an individual regret bound in cooperative stochastic MAB that is independent of the graph's diameter. When considering communication networks there are additional considerations beyond regret, such as message size and number of communication rounds. First, we show that our regret bound holds even if we restrict the messages to be of logarithmic size. Second, for logarithmic number of communication rounds, we obtain a regret bound of ${O}(\mathcal{R} / m+A \log T)$.

Individual Regret in Cooperative Stochastic Multi-Armed Bandits

TL;DR

This paper analyzes cooperative stochastic MAB on graphs with agents and arms, introducing Coop-SE for decentralized, message-passing learning. It proves diameter-free per-agent regret bounds of (and minimax ), along with matching lower bounds and extensions under CONGEST and limited-round communication. The results show that cooperation can dramatically reduce individual regret without depending on graph diameter, while maintaining robustness under constrained communication. The work also provides lower bounds, discusses random-action variants, and outlines several practical directions for reducing communication further while preserving performance. Overall, the study advances distributed learning in stochastic MABs by delivering diameter-free, near-optimal, communication-aware guarantees with broad implications for networked decision-making systems.

Abstract

We study the regret in stochastic Multi-Armed Bandits (MAB) with multiple agents that communicate over an arbitrary connected communication graph. We analyzed a variant of Cooperative Successive Elimination algorithm, , and show an individual regret bound of and a nearly matching lower bound. Here is the number of actions, the time horizon, the number of agents, and is the optimal single agent regret, where is the sub-optimality gap of action . Our work is the first to show an individual regret bound in cooperative stochastic MAB that is independent of the graph's diameter. When considering communication networks there are additional considerations beyond regret, such as message size and number of communication rounds. First, we show that our regret bound holds even if we restrict the messages to be of logarithmic size. Second, for logarithmic number of communication rounds, we obtain a regret bound of .

Paper Structure

This paper contains 41 sections, 57 theorems, 148 equations, 1 table, 12 algorithms.

Key Result

Theorem 1

When each agent plays alg:coop-SE-sus-main, the individual regret of each agent is,

Theorems & Definitions (154)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • proof : Proof sketch
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Remark 1
  • proof : Proof of \ref{['thm:diam']}
  • Remark 2
  • ...and 144 more