Cooperative Multi-Agent Graph Bandits: UCB Algorithm and Regret Analysis

Phevos Paschalidis; Runyu Zhang; Na Li

Cooperative Multi-Agent Graph Bandits: UCB Algorithm and Regret Analysis

Phevos Paschalidis, Runyu Zhang, Na Li

TL;DR

An Upper Confidence Bound (UCB)-based learning algorithm is proposed, Multi-G-UCB, and it is proved that its expected regret over $T$ steps is bounded by $O(\gamma N\log(T)[\sqrt{KT}+DK])$, where $D$ is the diameter of graph $G$ and $\gamma$ a boundedness parameter associated with the weight functions.

Abstract

In this paper, we formulate the multi-agent graph bandit problem as a multi-agent extension of the graph bandit problem introduced by Zhang, Johansson, and Li [CISS 57, 1-6 (2023)]. In our formulation, $N$ cooperative agents travel on a connected graph $G$ with $K$ nodes. Upon arrival at each node, agents observe a random reward drawn from a node-dependent probability distribution. The reward of the system is modeled as a weighted sum of the rewards the agents observe, where the weights capture some transformation of the reward associated with multiple agents sampling the same node at the same time. We propose an Upper Confidence Bound (UCB)-based learning algorithm, Multi-G-UCB, and prove that its expected regret over $T$ steps is bounded by $O(γN\log(T)[\sqrt{KT} + DK])$, where $D$ is the diameter of graph $G$ and $γ$ a boundedness parameter associated with the weight functions. Lastly, we numerically test our algorithm by comparing it to alternative methods.

Cooperative Multi-Agent Graph Bandits: UCB Algorithm and Regret Analysis

TL;DR

An Upper Confidence Bound (UCB)-based learning algorithm is proposed, Multi-G-UCB, and it is proved that its expected regret over

steps is bounded by

, where

is the diameter of graph

and

a boundedness parameter associated with the weight functions.

Abstract

cooperative agents travel on a connected graph

with

nodes. Upon arrival at each node, agents observe a random reward drawn from a node-dependent probability distribution. The reward of the system is modeled as a weighted sum of the rewards the agents observe, where the weights capture some transformation of the reward associated with multiple agents sampling the same node at the same time. We propose an Upper Confidence Bound (UCB)-based learning algorithm, Multi-G-UCB, and prove that its expected regret over

steps is bounded by

, where

is the diameter of graph

and

a boundedness parameter associated with the weight functions. Lastly, we numerically test our algorithm by comparing it to alternative methods.

Paper Structure (13 sections, 2 theorems, 41 equations, 2 figures, 2 algorithms)

This paper contains 13 sections, 2 theorems, 41 equations, 2 figures, 2 algorithms.

Introduction
Problem Formulation
Multi-Agent Graph Bandit
Examples
Drone-Enabled Internet Access
Factory Production
Multi-Agent Graph Bandit Learning
Algorithm
Offline Planning
Initialization
Main Results
Numerical Simulations
CONCLUSIONS AND FUTURE WORK

Key Result

Theorem 1

Let $T \geq 1$ be any positive integer. Given an instance of a multi-agent graph bandit problem with $N$ agents, $K$ arms each with weight function $f_k(\cdot)$ bounded such that $f_k(c) \leq \gamma c$, and a graph diameter of $D$, the expected system-wide regret of Multi-G-UCB after taking a total

Figures (2)

Figure 1: The graph $G$.
Figure 2: The average cumulative regret of each algorithm across 10 trials.

Theorems & Definitions (6)

Remark 1: Algorithmic Extension to Weighted Graphs
Remark 2: Algorithmic Extension to Directed Graphs
Theorem 1: Regret of Multi-G-UCB
Remark 3: Regret Consideration of Weighted Graphs
Lemma 1: Number of episodes
proof

Cooperative Multi-Agent Graph Bandits: UCB Algorithm and Regret Analysis

TL;DR

Abstract

Cooperative Multi-Agent Graph Bandits: UCB Algorithm and Regret Analysis

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (6)