Table of Contents
Fetching ...

Competitive Multi-armed Bandit Games for Resource Sharing

Hongbo Li, Lingjie Duan

TL;DR

This work studies N-player competitive MAB (CMAB) games for resource sharing under unknown Bernoulli rewards with collisions. It develops threshold-based policies for both selfish and socially optimal play, reveals that selfish behavior can cause an infinite price of anarchy ($PoA$), and shows that informational mechanisms alone cannot fix this when players are non-myopic. It then introduces the Combined Informational and Side-Payment (CISP) mechanism, which yields $PoA=1$ by aligning incentives through information sharing and monetary transfers while preserving budget balance. Experiments corroborate the theory, demonstrating that CISP matches the convergence pace of the social optimum and eliminates the inefficiency observed under selfish behavior and information hiding.

Abstract

In modern resource-sharing systems, multiple agents access limited resources with unknown stochastic conditions to perform tasks. When multiple agents access the same resource (arm) simultaneously, they compete for successful usage, leading to contention and reduced rewards. This motivates our study of competitive multi-armed bandit (CMAB) games. In this paper, we study a new N-player K-arm competitive MAB game, where non-myopic players (agents) compete with each other to form diverse private estimations of unknown arms over time. Their possible collisions on same arms and time-varying nature of arm rewards make the policy analysis more involved than existing studies for myopic players. We explicitly analyze the threshold-based structures of social optimum and existing selfish policy, showing that the latter causes prolonged convergence time $Ω(\frac{K}{η^2}\ln({\frac{KN}δ}))$, while socially optimal policy with coordinated communication reduces it to $\mathcal{O}(\frac{K}{Nη^2}\ln{(\frac{K}δ)})$. Based on the comparison, we prove that the competition among selfish players for the best arm can result in an infinite price of anarchy (PoA), indicating an arbitrarily large efficiency loss compared to social optimum. We further prove that no informational (non-monetary) mechanism (including Bayesian persuasion) can reduce the infinite PoA, as the strategic misreporting by non-myopic players undermines such approaches. To address this, we propose a Combined Informational and Side-Payment (CISP) mechanism, which provides socially optimal arm recommendations with proper informational and monetary incentives to players according to their time-varying private beliefs. Our CISP mechanism keeps ex-post budget balanced for social planner and ensures truthful reporting from players, achieving the minimum PoA=1 and same convergence time as social optimum.

Competitive Multi-armed Bandit Games for Resource Sharing

TL;DR

This work studies N-player competitive MAB (CMAB) games for resource sharing under unknown Bernoulli rewards with collisions. It develops threshold-based policies for both selfish and socially optimal play, reveals that selfish behavior can cause an infinite price of anarchy (), and shows that informational mechanisms alone cannot fix this when players are non-myopic. It then introduces the Combined Informational and Side-Payment (CISP) mechanism, which yields by aligning incentives through information sharing and monetary transfers while preserving budget balance. Experiments corroborate the theory, demonstrating that CISP matches the convergence pace of the social optimum and eliminates the inefficiency observed under selfish behavior and information hiding.

Abstract

In modern resource-sharing systems, multiple agents access limited resources with unknown stochastic conditions to perform tasks. When multiple agents access the same resource (arm) simultaneously, they compete for successful usage, leading to contention and reduced rewards. This motivates our study of competitive multi-armed bandit (CMAB) games. In this paper, we study a new N-player K-arm competitive MAB game, where non-myopic players (agents) compete with each other to form diverse private estimations of unknown arms over time. Their possible collisions on same arms and time-varying nature of arm rewards make the policy analysis more involved than existing studies for myopic players. We explicitly analyze the threshold-based structures of social optimum and existing selfish policy, showing that the latter causes prolonged convergence time , while socially optimal policy with coordinated communication reduces it to . Based on the comparison, we prove that the competition among selfish players for the best arm can result in an infinite price of anarchy (PoA), indicating an arbitrarily large efficiency loss compared to social optimum. We further prove that no informational (non-monetary) mechanism (including Bayesian persuasion) can reduce the infinite PoA, as the strategic misreporting by non-myopic players undermines such approaches. To address this, we propose a Combined Informational and Side-Payment (CISP) mechanism, which provides socially optimal arm recommendations with proper informational and monetary incentives to players according to their time-varying private beliefs. Our CISP mechanism keeps ex-post budget balanced for social planner and ensures truthful reporting from players, achieving the minimum PoA=1 and same convergence time as social optimum.

Paper Structure

This paper contains 22 sections, 11 theorems, 54 equations, 4 figures, 1 table, 3 algorithms.

Key Result

Lemma 1

For player $n$ with empirical mean reward set $\tilde{\bm{\mu}}^n(t)$ at $t$, we have $\mathbb{E}[\tilde{\bm{\mu}}^n(t+1)]=\tilde{\bm{\mu}}^n(t)$.

Figures (4)

  • Figure 1: The process of arm pulling and reward observations when $|\mathbb{N}_k(t)|$ players simultaneously choose arm $k$. In this case, only one player is randomly selected with probability $\frac{1}{|\mathbb{N}_k(t)|}$ to pull arm $k$ (e.g., player $n$ with $\sigma_n(t)=0$ in Fig. \ref{['fig:competitive_reward']}) and receive a reward of $r_k^n(t)=r_k(t)$, where $r_k(t)$ is given in \ref{['r_k(t)']}. As in boursier2019sicwang2020optimal, the remaining $|\mathbb{N}_k(t)|-1$ players observe collisions involving $|\mathbb{N}_k(t)|$ players there and receive zero rewards (with $\sigma_{n'}(t)=1$). In other words, these $|\mathbb{N}_k(t)|-1$ players have no effective reward observation of this arm.
  • Figure 2: Comparison of $N=5$ players' average empirical mean rewards of arm 1 under selfish policy \ref{['pi_n_s(t)']}.
  • Figure 3: Comparison of average learning errors, under selfish policy \ref{['pi_n_s(t)']}, information-hiding mechanism ( tavafoghi2017informationalli2023congestion), our CISP mechanism in \ref{['def:info_mechanism']}, and the socially optimal policy \ref{['pi^*(t)']}.
  • Figure 4: Comparison of average inefficiency ratios caused by selfish policy, information-hiding, and our CISP mechanism. We vary the number of players $N\in\{2,4,6,8,10\}$.

Theorems & Definitions (17)

  • Definition 1: Selfish policy
  • Lemma 1
  • Proposition 1
  • Definition 2: $\epsilon$-Nash Equilibrium ($\epsilon$-NE)
  • Definition 3: Convergence
  • Proposition 2
  • Lemma 2
  • Proposition 3
  • Proposition 4
  • Theorem 1
  • ...and 7 more