Table of Contents
Fetching ...

Decentralized Asynchronous Multi-player Bandits

Jingqi Fan, Canzhe Zhao, Shuai Li, Siwei Wang

TL;DR

Decentralized asynchronous MP-MAB problems arise when no global clock exists and players can join/leave arbitrarily. The authors propose ACE, an adaptive exploration-exploitation algorithm where each player maintains an occupied-arm set $\mathcal{A}^j$, two queues $\mathcal{P}^j_k$ and $\mathcal{Q}^j_k$, and alternates phases to minimize collisions while tracking availability; exploration pulls from $[K]\setminus \mathcal{A}^j$ uniformly, while exploitation mostly repeats a chosen arm with occasional probing of $\mathcal{A}^j$ with probability $\varepsilon$. They prove a regret bound $R(T) = \mathcal{O}(\sqrt{T \log T} + \log T / \Delta^2)$, showing sublinear performance in highly asynchronous, decentralized environments; experiments on Gaussian rewards with varying $K$ and $M$ demonstrate ACE's robustness and scalability, outperforming or matching synchronized baselines without requiring a global clock or exact lower bounds.

Abstract

In recent years, multi-player multi-armed bandits (MP-MAB) have been extensively studied due to their wide applications in cognitive radio networks and Internet of Things systems. While most existing research on MP-MAB focuses on synchronized settings, real-world systems are often decentralized and asynchronous, where players may enter or leave the system at arbitrary times, and do not have a global clock. This decentralized asynchronous setting introduces two major challenges. First, without a global time, players cannot implicitly coordinate their actions through time, making it difficult to avoid collisions. Second, it is important to detect how many players are in the system, but doing so may cost a lot. In this paper, we address the challenges posed by such a fully asynchronous setting in a decentralized environment. We develop a novel algorithm in which players adaptively change between exploration and exploitation. During exploration, players uniformly pull their arms, reducing the probability of collisions and effectively mitigating the first challenge. Meanwhile, players continue pulling arms currently exploited by others with a small probability, enabling them to detect when a player has left, thereby addressing the second challenge. We prove that our algorithm achieves a regret of $\mathcal{O}(\sqrt{T \log T} + {\log T}/{Δ^2})$, where $Δ$ is the minimum expected reward gap between any two arms. To the best of our knowledge, this is the first efficient MP-MAB algorithm in the asynchronous and decentralized environment. Extensive experiments further validate the effectiveness and robustness of our algorithm, demonstrating its applicability to real-world scenarios.

Decentralized Asynchronous Multi-player Bandits

TL;DR

Decentralized asynchronous MP-MAB problems arise when no global clock exists and players can join/leave arbitrarily. The authors propose ACE, an adaptive exploration-exploitation algorithm where each player maintains an occupied-arm set , two queues and , and alternates phases to minimize collisions while tracking availability; exploration pulls from uniformly, while exploitation mostly repeats a chosen arm with occasional probing of with probability . They prove a regret bound , showing sublinear performance in highly asynchronous, decentralized environments; experiments on Gaussian rewards with varying and demonstrate ACE's robustness and scalability, outperforming or matching synchronized baselines without requiring a global clock or exact lower bounds.

Abstract

In recent years, multi-player multi-armed bandits (MP-MAB) have been extensively studied due to their wide applications in cognitive radio networks and Internet of Things systems. While most existing research on MP-MAB focuses on synchronized settings, real-world systems are often decentralized and asynchronous, where players may enter or leave the system at arbitrary times, and do not have a global clock. This decentralized asynchronous setting introduces two major challenges. First, without a global time, players cannot implicitly coordinate their actions through time, making it difficult to avoid collisions. Second, it is important to detect how many players are in the system, but doing so may cost a lot. In this paper, we address the challenges posed by such a fully asynchronous setting in a decentralized environment. We develop a novel algorithm in which players adaptively change between exploration and exploitation. During exploration, players uniformly pull their arms, reducing the probability of collisions and effectively mitigating the first challenge. Meanwhile, players continue pulling arms currently exploited by others with a small probability, enabling them to detect when a player has left, thereby addressing the second challenge. We prove that our algorithm achieves a regret of , where is the minimum expected reward gap between any two arms. To the best of our knowledge, this is the first efficient MP-MAB algorithm in the asynchronous and decentralized environment. Extensive experiments further validate the effectiveness and robustness of our algorithm, demonstrating its applicability to real-world scenarios.

Paper Structure

This paper contains 16 sections, 17 theorems, 20 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

theorem 1

Let $\varepsilon = \min\{ \sqrt{\frac{1141m^3\ln(T) }{2T}}, \frac{1}{K}, \frac{1}{10} \}$. Then given $K$ arms and $M$ players, the regret of Algorithm alg:1 is bounded by where $\Delta := \min_{k\leq m}(\mu_k- \mu_{k+1})$.

Figures (4)

  • Figure 1: Comparison of cumulative regret for different numbers of arms $\mathbf{K}$ under different asynchronization settings.
  • Figure 2: Comparison of cumulative regret between UCB with multiple parameters and ACE for different $\mathbf{K}$ under different asynchronous settings.
  • Figure 3: Comparison of cumulative regret for different numbers of players $\mathbf{M}$ under different asynchronization settings.
  • Figure 4: Comparison of cumulative regret between UCB with multiple parameters and ACE for different $\mathbf{M}$ under different asynchronous settings.

Theorems & Definitions (20)

  • remark 1
  • remark 2
  • theorem 1
  • remark 3
  • lemma 4
  • lemma 5
  • lemma 6
  • lemma 7
  • lemma 8
  • lemma 9
  • ...and 10 more