Table of Contents
Fetching ...

Cooperative Bandit Learning in Directed Networks with Arm-Access Constraints

Evagoras Makridis, Themistoklis Charalambous

Abstract

Sequential decision-making under uncertainty often involves multiple agents learning which actions (arms) yield the highest rewards through repeated interaction with a stochastic environment. This setting is commonly modeled by cooperative multi-agent multi-armed bandit problems, where agents explore and share information without centralized coordination. In many realistic systems, agents have heterogeneous capabilities that limit their access to subsets of arms and communicate over asymmetric networks represented by directed graphs. In this work, we study multi-agent multi-armed bandit problems with partial arm access, where agents explore and exploit only the arms available to them while exchanging information with neighbors. We propose a distributed consensus-based upper confidence bound (UCB) algorithm that accounts for both the arm accessibility structure and network asymmetry. Our approach employs a mass-preserving information mixing mechanism, ensuring that reward estimates remain unbiased across the network despite accessibility constraints and asymmetric information flow. Under standard stochastic assumptions, we establish logarithmic regret for every agent, with explicit dependence on network mixing properties and arm accessibility constraints. These results quantify how heterogeneous arm access and directed communication shape cooperative learning performance.

Cooperative Bandit Learning in Directed Networks with Arm-Access Constraints

Abstract

Sequential decision-making under uncertainty often involves multiple agents learning which actions (arms) yield the highest rewards through repeated interaction with a stochastic environment. This setting is commonly modeled by cooperative multi-agent multi-armed bandit problems, where agents explore and share information without centralized coordination. In many realistic systems, agents have heterogeneous capabilities that limit their access to subsets of arms and communicate over asymmetric networks represented by directed graphs. In this work, we study multi-agent multi-armed bandit problems with partial arm access, where agents explore and exploit only the arms available to them while exchanging information with neighbors. We propose a distributed consensus-based upper confidence bound (UCB) algorithm that accounts for both the arm accessibility structure and network asymmetry. Our approach employs a mass-preserving information mixing mechanism, ensuring that reward estimates remain unbiased across the network despite accessibility constraints and asymmetric information flow. Under standard stochastic assumptions, we establish logarithmic regret for every agent, with explicit dependence on network mixing properties and arm accessibility constraints. These results quantify how heterogeneous arm access and directed communication shape cooperative learning performance.
Paper Structure (18 sections, 5 theorems, 50 equations, 4 figures, 1 algorithm)

This paper contains 18 sections, 5 theorems, 50 equations, 4 figures, 1 algorithm.

Key Result

Lemma 1

Consider the recursion of the estimates of the cumulative rewards in eq:running_consensus_a for some arm $k\in\mathcal{K}$, with $\hat{s}(0)=\mathbf{0}_N$. Suppose the communication graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ satisfies Assumption ass:3, associated with the weight matrix $P$, and t

Figures (4)

  • Figure 1: Example of the cooperative multi-agent multi-armed bandit framework with arm-accessibility constraints. Blue circular nodes represent agents and beige rectangular nodes represent arms. Directed solid edges indicate communication links among agents, while dashed edges denote arm-accessibility relations between agents and arms.
  • Figure 2: The distributed multi-agent multi-armed bandits setup with arm-accessibility constraints. Blue circular nodes represent agents; beige rectangular nodes represent arms. Communication among the agents is shown in black directional arrows, while arm-accessibility per arm (also given in matrix $C$) is depicted with gray dashed lines.
  • Figure 3: Sum of individual agents' cumulative regret for UCB1 (no comm.) and A2C-UCB, under arm-access constraints.
  • Figure 4: Sum of individual agents' cumulative regret for A2C-UCB, UCB1 (no comm.), and zhu2025decentralized under full arm accessibility.

Theorems & Definitions (15)

  • Remark 1
  • Remark 2
  • Remark 3
  • Lemma 1: Dynamic tracking of running ratio consensus
  • proof
  • Lemma 2: Consensus tracking error
  • proof
  • Remark 4
  • Lemma 3: A2C-UCB confidence bound
  • proof
  • ...and 5 more