Table of Contents
Fetching ...

Multi-Agent Stage-wise Conservative Linear Bandits

Amirhoseein Afsharrad, Ahmadreza Moradipari, Sanjay Lall

TL;DR

The paper studies distributed stochastic linear bandits under stage-wise conservatism in a connected multi-agent network, enforcing a per-round safety constraint $(1-\alpha)$ relative to a known baseline. It introduces MA-SCLUCB, an episodic algorithm that alternates between action selection and an accelerated consensus-based information flow to estimate a global parameter (the average of local parameters) and construct distributed confidence sets and safe sets. The authors prove a high-probability regret bound of $\tilde{O}\left(\frac{d}{\sqrt{N}}\sqrt{T} \cdot \frac{\log(NT)}{\sqrt{\log(1/|\lambda_2|)}}\right)$, showing a $1/\sqrt{N}$ gain from collaboration, logarithmic communication overhead for well-connected networks, and only lower-order regret from safety constraints. Experiments corroborate the theoretical findings, illustrating how network connectivity, conservativeness level, and network size affect performance. The work demonstrates that safe, scalable distributed learning is achievable with near-optimal performance in reasonably connected networks, with implications for practical applications like recommender systems and autonomous networks.

Abstract

In many real-world applications such as recommendation systems, multiple learning agents must balance exploration and exploitation while maintaining safety guarantees to avoid catastrophic failures. We study the stochastic linear bandit problem in a multi-agent networked setting where agents must satisfy stage-wise conservative constraints. A network of $N$ agents collaboratively maximizes cumulative reward while ensuring that the expected reward at every round is no less than $(1-α)$ times that of a baseline policy. Each agent observes local rewards with unknown parameters, but the network optimizes for the global parameter (average of local parameters). Agents communicate only with immediate neighbors, and each communication round incurs additional regret. We propose MA-SCLUCB (Multi-Agent Stage-wise Conservative Linear UCB), an episodic algorithm alternating between action selection and consensus-building phases. We prove that MA-SCLUCB achieves regret $\tilde{O}\left(\frac{d}{\sqrt{N}}\sqrt{T}\cdot\frac{\log(NT)}{\sqrt{\log(1/|λ_2|)}}\right)$ with high probability, where $d$ is the dimension, $T$ is the horizon, and $|λ_2|$ is the network's second largest eigenvalue magnitude. Our analysis shows: (i) collaboration yields $\frac{1}{\sqrt{N}}$ improvement despite local communication, (ii) communication overhead grows only logarithmically for well-connected networks, and (iii) stage-wise safety adds only lower-order regret. Thus, distributed learning with safety guarantees achieves near-optimal performance in reasonably connected networks.

Multi-Agent Stage-wise Conservative Linear Bandits

TL;DR

The paper studies distributed stochastic linear bandits under stage-wise conservatism in a connected multi-agent network, enforcing a per-round safety constraint relative to a known baseline. It introduces MA-SCLUCB, an episodic algorithm that alternates between action selection and an accelerated consensus-based information flow to estimate a global parameter (the average of local parameters) and construct distributed confidence sets and safe sets. The authors prove a high-probability regret bound of , showing a gain from collaboration, logarithmic communication overhead for well-connected networks, and only lower-order regret from safety constraints. Experiments corroborate the theoretical findings, illustrating how network connectivity, conservativeness level, and network size affect performance. The work demonstrates that safe, scalable distributed learning is achievable with near-optimal performance in reasonably connected networks, with implications for practical applications like recommender systems and autonomous networks.

Abstract

In many real-world applications such as recommendation systems, multiple learning agents must balance exploration and exploitation while maintaining safety guarantees to avoid catastrophic failures. We study the stochastic linear bandit problem in a multi-agent networked setting where agents must satisfy stage-wise conservative constraints. A network of agents collaboratively maximizes cumulative reward while ensuring that the expected reward at every round is no less than times that of a baseline policy. Each agent observes local rewards with unknown parameters, but the network optimizes for the global parameter (average of local parameters). Agents communicate only with immediate neighbors, and each communication round incurs additional regret. We propose MA-SCLUCB (Multi-Agent Stage-wise Conservative Linear UCB), an episodic algorithm alternating between action selection and consensus-building phases. We prove that MA-SCLUCB achieves regret with high probability, where is the dimension, is the horizon, and is the network's second largest eigenvalue magnitude. Our analysis shows: (i) collaboration yields improvement despite local communication, (ii) communication overhead grows only logarithmically for well-connected networks, and (iii) stage-wise safety adds only lower-order regret. Thus, distributed learning with safety guarantees achieves near-optimal performance in reasonably connected networks.

Paper Structure

This paper contains 23 sections, 4 theorems, 27 equations, 4 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

Let $W$ be the doubly stochastic weight matrix with second largest eigenvalue (in absolute value) $|\lambda_2| < 1$. After $q(s) = \lceil \log(2Ns) / \sqrt{2\log(1/|\lambda_2|)} \rceil$ communication rounds using the accelerated consensus protocol (Algorithm alg:mix), each agent $i$ obtains an estim

Figures (4)

  • Figure 1: Cumulative regret vs. connectivity
  • Figure 2: Expected reward and safety threshold
  • Figure 3: Parameter estimation convergence
  • Figure 4: Cumulative regret vs. $\alpha$

Theorems & Definitions (5)

  • Definition 1: Sub-Gaussian Random Variable
  • Lemma 1: Consensus Accuracy
  • Theorem 1: Distributed Confidence Sets and Safe-Set Inclusion
  • Lemma 2: Number of Conservative Episodes
  • Theorem 2: High-Probability Regret of MA-SCLUCB