Table of Contents
Fetching ...

Adaptive Sample Sharing for Multi Agent Linear Bandits

Hamza Cherkaoui, Merwan Barlier, Igor Colin

TL;DR

The paper addresses regret minimization in multi-agent linear bandits by introducing Bandit Adaptive Sample Sharing (BASS), a mechanism that adaptively shares samples based on a Mahalanobis-distance-based similarity to balance collaborative bias against uncertainty. It formalizes a separation-test and separation-time concept to determine when collaboration is beneficial, and provides thorough theoretical analysis including bounds on separation time, cumulative pseudo-regret during collaboration, network-level regret, and clustering-aware performance. The approach is validated with extensive experiments on synthetic and real datasets, showing significant improvement over state-of-the-art methods and the ability to recover cluster structure when present. Overall, BASS advances distributed bandit learning by enabling anisotropic, data-driven collaboration with rigorous guarantees and practical efficacy.

Abstract

The multi-agent linear bandit setting is a well-known setting for which designing efficient collaboration between agents remains challenging. This paper studies the impact of data sharing among agents on regret minimization. Unlike most existing approaches, our contribution does not rely on any assumptions on the bandit parameters structure. Our main result formalizes the trade-off between the bias and uncertainty of the bandit parameter estimation for efficient collaboration. This result is the cornerstone of the Bandit Adaptive Sample Sharing (BASS) algorithm, whose efficiency over the current state-of-the-art is validated through both theoretical analysis and empirical evaluations on both synthetic and real-world datasets. Furthermore, we demonstrate that, when agents' parameters display a cluster structure, our algorithm accurately recovers them.

Adaptive Sample Sharing for Multi Agent Linear Bandits

TL;DR

The paper addresses regret minimization in multi-agent linear bandits by introducing Bandit Adaptive Sample Sharing (BASS), a mechanism that adaptively shares samples based on a Mahalanobis-distance-based similarity to balance collaborative bias against uncertainty. It formalizes a separation-test and separation-time concept to determine when collaboration is beneficial, and provides thorough theoretical analysis including bounds on separation time, cumulative pseudo-regret during collaboration, network-level regret, and clustering-aware performance. The approach is validated with extensive experiments on synthetic and real datasets, showing significant improvement over state-of-the-art methods and the ability to recover cluster structure when present. Overall, BASS advances distributed bandit learning by enabling anisotropic, data-driven collaboration with rigorous guarantees and practical efficacy.

Abstract

The multi-agent linear bandit setting is a well-known setting for which designing efficient collaboration between agents remains challenging. This paper studies the impact of data sharing among agents on regret minimization. Unlike most existing approaches, our contribution does not rely on any assumptions on the bandit parameters structure. Our main result formalizes the trade-off between the bias and uncertainty of the bandit parameter estimation for efficient collaboration. This result is the cornerstone of the Bandit Adaptive Sample Sharing (BASS) algorithm, whose efficiency over the current state-of-the-art is validated through both theoretical analysis and empirical evaluations on both synthetic and real-world datasets. Furthermore, we demonstrate that, when agents' parameters display a cluster structure, our algorithm accurately recovers them.
Paper Structure (60 sections, 23 theorems, 106 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 60 sections, 23 theorems, 106 equations, 10 figures, 4 tables, 1 algorithm.

Key Result

Theorem 4.1

Let $\delta \in (0, 1)$, $t > 0$, and $\bm{\theta}^* \in \mathbb{R}^d$. Let $(\bm{x}_s)_{1 \leq s \leq t}$ denote the sequence of arms pulled up to time $t$. Under assum:bounded_norms and assum:sub_gaussian_noise, the following holds with probability at least $1 - \delta$: where and $\bm{A}_t = \sum_{s = 1}^{t} \bm{x}_s \bm{x}_s^\top$ is the empirical design matrix.

Figures (10)

  • Figure 1: Estimation of $\bm{\theta}_1^{*}$ depicted with $\bullet$ (resp. $\bm{\theta}_{\mathrm{c}}^{*}$ with $\bullet$), $\hat{\bm{\theta}}_{1, t}$ with $\blacktriangle$ (resp. $\hat{\bm{\theta}}_{\mathrm{c}, t}$ with $\blacktriangle$), along with the corresponding confidence ellipsoid in blue (resp. in orange). The collaborative estimate has a reduced uncertainty ellipsoid $\|\hat{\bm{\theta}}_{\mathrm{c}, t} - \bm{\theta}_i^{*}\|_{\bm{A}_t}$.
  • Figure 2: Estimation of $\bm{\theta}_i^{*}$ depicted with $\bullet$ (resp. $\bm{\theta}_j^{*}$ with $\bullet$), $\hat{\bm{\theta}}_{i, t}$ with $\blacktriangle$ (resp. $\hat{\bm{\theta}}_{j, t}$ with $\blacktriangle$), along with the corresponding confidence ellipsoid in blue (resp. in orange). With high probability, the bias, depicted with a red dashed line [0.5ex]1cm1pt3pt, becomes detrimental when the ellipsoids are separated.
  • Figure 3: Comparison of the averaged evolution of the cumulative regret last value $R_t$ for the different synthetic environments considered.
  • Figure 4: Comparison of the averaged evolution of the cumulative regret last value $R_T$ w.r.t. the UCB parameter $\alpha$ for the different synthetic environments considered.
  • Figure 5: Overview of the Bandit Adaptive Sample Sharing (BASS) Algorithm
  • ...and 5 more figures

Theorems & Definitions (41)

  • Theorem 4.1: Confidence Ellipsoid for Bandit Parameter Estimation
  • Lemma 5.1: Instantaneous Regret Upper Bounds
  • Lemma 5.2: Ellipsoid Separation Under Synchronous Pulling
  • Definition 5.1: $\gamma$-relaxed ellipsoid separation test function
  • Definition 5.2: Separation Time
  • Theorem 5.1: Lower Bound on the Separation Time ( T_s ) Between Two Agents
  • Theorem 6.1: Upper Bound on the Separation Time ( T_s )
  • Theorem 6.2: Individual Regret During the Collaboration Phase
  • Theorem 6.3: Regret During the Collaboration Phase
  • Lemma 6.1: Expected Number of Misassigned Agents
  • ...and 31 more