Adaptive Sample Sharing for Multi Agent Linear Bandits

Hamza Cherkaoui; Merwan Barlier; Igor Colin

Adaptive Sample Sharing for Multi Agent Linear Bandits

Hamza Cherkaoui, Merwan Barlier, Igor Colin

TL;DR

The paper addresses regret minimization in multi-agent linear bandits by introducing Bandit Adaptive Sample Sharing (BASS), a mechanism that adaptively shares samples based on a Mahalanobis-distance-based similarity to balance collaborative bias against uncertainty. It formalizes a separation-test and separation-time concept to determine when collaboration is beneficial, and provides thorough theoretical analysis including bounds on separation time, cumulative pseudo-regret during collaboration, network-level regret, and clustering-aware performance. The approach is validated with extensive experiments on synthetic and real datasets, showing significant improvement over state-of-the-art methods and the ability to recover cluster structure when present. Overall, BASS advances distributed bandit learning by enabling anisotropic, data-driven collaboration with rigorous guarantees and practical efficacy.

Abstract

The multi-agent linear bandit setting is a well-known setting for which designing efficient collaboration between agents remains challenging. This paper studies the impact of data sharing among agents on regret minimization. Unlike most existing approaches, our contribution does not rely on any assumptions on the bandit parameters structure. Our main result formalizes the trade-off between the bias and uncertainty of the bandit parameter estimation for efficient collaboration. This result is the cornerstone of the Bandit Adaptive Sample Sharing (BASS) algorithm, whose efficiency over the current state-of-the-art is validated through both theoretical analysis and empirical evaluations on both synthetic and real-world datasets. Furthermore, we demonstrate that, when agents' parameters display a cluster structure, our algorithm accurately recovers them.

Adaptive Sample Sharing for Multi Agent Linear Bandits

TL;DR

Abstract

Paper Structure (60 sections, 23 theorems, 106 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 60 sections, 23 theorems, 106 equations, 10 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Cluster structure known in advance
Estimating the cluster structure as a graph
Multivariate approaches
Our contributions
Preliminaries
Notation
A simple two agent problem
Single agent regret minimization
Adaptive sample sharing
Collaborative setting
Separation test
Separation time
The Bandit Adaptive Sample Sharing algorithm
...and 45 more sections

Key Result

Theorem 4.1

Let $\delta \in (0, 1)$, $t > 0$, and $\bm{\theta}^* \in \mathbb{R}^d$. Let $(\bm{x}_s)_{1 \leq s \leq t}$ denote the sequence of arms pulled up to time $t$. Under assum:bounded_norms and assum:sub_gaussian_noise, the following holds with probability at least $1 - \delta$: where and $\bm{A}_t = \sum_{s = 1}^{t} \bm{x}_s \bm{x}_s^\top$ is the empirical design matrix.

Figures (10)

Figure 1: Estimation of $\bm{\theta}_1^{*}$ depicted with $\bullet$ (resp. $\bm{\theta}_{\mathrm{c}}^{*}$ with $\bullet$), $\hat{\bm{\theta}}_{1, t}$ with $\blacktriangle$ (resp. $\hat{\bm{\theta}}_{\mathrm{c}, t}$ with $\blacktriangle$), along with the corresponding confidence ellipsoid in blue (resp. in orange). The collaborative estimate has a reduced uncertainty ellipsoid $\|\hat{\bm{\theta}}_{\mathrm{c}, t} - \bm{\theta}_i^{*}\|_{\bm{A}_t}$.
Figure 2: Estimation of $\bm{\theta}_i^{*}$ depicted with $\bullet$ (resp. $\bm{\theta}_j^{*}$ with $\bullet$), $\hat{\bm{\theta}}_{i, t}$ with $\blacktriangle$ (resp. $\hat{\bm{\theta}}_{j, t}$ with $\blacktriangle$), along with the corresponding confidence ellipsoid in blue (resp. in orange). With high probability, the bias, depicted with a red dashed line [0.5ex]1cm1pt3pt, becomes detrimental when the ellipsoids are separated.
Figure 3: Comparison of the averaged evolution of the cumulative regret last value $R_t$ for the different synthetic environments considered.
Figure 4: Comparison of the averaged evolution of the cumulative regret last value $R_T$ w.r.t. the UCB parameter $\alpha$ for the different synthetic environments considered.
Figure 5: Overview of the Bandit Adaptive Sample Sharing (BASS) Algorithm
...and 5 more figures

Theorems & Definitions (41)

Theorem 4.1: Confidence Ellipsoid for Bandit Parameter Estimation
Lemma 5.1: Instantaneous Regret Upper Bounds
Lemma 5.2: Ellipsoid Separation Under Synchronous Pulling
Definition 5.1: $\gamma$-relaxed ellipsoid separation test function
Definition 5.2: Separation Time
Theorem 5.1: Lower Bound on the Separation Time ( T_s ) Between Two Agents
Theorem 6.1: Upper Bound on the Separation Time ( T_s )
Theorem 6.2: Individual Regret During the Collaboration Phase
Theorem 6.3: Regret During the Collaboration Phase
Lemma 6.1: Expected Number of Misassigned Agents
...and 31 more

Adaptive Sample Sharing for Multi Agent Linear Bandits

TL;DR

Abstract

Adaptive Sample Sharing for Multi Agent Linear Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (41)