Table of Contents
Fetching ...

Cooperative Multi-Agent Graph Bandits: UCB Algorithm and Regret Analysis

Phevos Paschalidis, Runyu Zhang, Na Li

TL;DR

An Upper Confidence Bound (UCB)-based learning algorithm is proposed, Multi-G-UCB, and it is proved that its expected regret over $T$ steps is bounded by $O(\gamma N\log(T)[\sqrt{KT}+DK])$, where $D$ is the diameter of graph $G$ and $\gamma$ a boundedness parameter associated with the weight functions.

Abstract

In this paper, we formulate the multi-agent graph bandit problem as a multi-agent extension of the graph bandit problem introduced by Zhang, Johansson, and Li [CISS 57, 1-6 (2023)]. In our formulation, $N$ cooperative agents travel on a connected graph $G$ with $K$ nodes. Upon arrival at each node, agents observe a random reward drawn from a node-dependent probability distribution. The reward of the system is modeled as a weighted sum of the rewards the agents observe, where the weights capture some transformation of the reward associated with multiple agents sampling the same node at the same time. We propose an Upper Confidence Bound (UCB)-based learning algorithm, Multi-G-UCB, and prove that its expected regret over $T$ steps is bounded by $O(γN\log(T)[\sqrt{KT} + DK])$, where $D$ is the diameter of graph $G$ and $γ$ a boundedness parameter associated with the weight functions. Lastly, we numerically test our algorithm by comparing it to alternative methods.

Cooperative Multi-Agent Graph Bandits: UCB Algorithm and Regret Analysis

TL;DR

An Upper Confidence Bound (UCB)-based learning algorithm is proposed, Multi-G-UCB, and it is proved that its expected regret over steps is bounded by , where is the diameter of graph and a boundedness parameter associated with the weight functions.

Abstract

In this paper, we formulate the multi-agent graph bandit problem as a multi-agent extension of the graph bandit problem introduced by Zhang, Johansson, and Li [CISS 57, 1-6 (2023)]. In our formulation, cooperative agents travel on a connected graph with nodes. Upon arrival at each node, agents observe a random reward drawn from a node-dependent probability distribution. The reward of the system is modeled as a weighted sum of the rewards the agents observe, where the weights capture some transformation of the reward associated with multiple agents sampling the same node at the same time. We propose an Upper Confidence Bound (UCB)-based learning algorithm, Multi-G-UCB, and prove that its expected regret over steps is bounded by , where is the diameter of graph and a boundedness parameter associated with the weight functions. Lastly, we numerically test our algorithm by comparing it to alternative methods.
Paper Structure (13 sections, 2 theorems, 41 equations, 2 figures, 2 algorithms)

This paper contains 13 sections, 2 theorems, 41 equations, 2 figures, 2 algorithms.

Key Result

Theorem 1

Let $T \geq 1$ be any positive integer. Given an instance of a multi-agent graph bandit problem with $N$ agents, $K$ arms each with weight function $f_k(\cdot)$ bounded such that $f_k(c) \leq \gamma c$, and a graph diameter of $D$, the expected system-wide regret of Multi-G-UCB after taking a total

Figures (2)

  • Figure 1: The graph $G$.
  • Figure 2: The average cumulative regret of each algorithm across 10 trials.

Theorems & Definitions (6)

  • Remark 1: Algorithmic Extension to Weighted Graphs
  • Remark 2: Algorithmic Extension to Directed Graphs
  • Theorem 1: Regret of Multi-G-UCB
  • Remark 3: Regret Consideration of Weighted Graphs
  • Lemma 1: Number of episodes
  • proof