Table of Contents
Fetching ...

Fair Multi-Agent Bandits

Amir Leshem

TL;DR

This work addresses fair resource allocation in a distributed, collision-based multi-agent bandit setting, where agents cannot communicate beyond collision signals. It introduces a three-phase framework—agent ordering, exploration, and distributed matching via a collision-enabled auction—to learn a max-min fair allocation and then exploit it. The authors prove a near-optimal regret bound of $Reg(T) = O\left(N^3 \log \frac{B}{\Delta} f(\log T) \log T\right)$ with unknown reward bounds, and show polynomial (in $N$) scalability where prior results had exponential dependence on the number of agents. Simulations demonstrate improved regret and faster convergence over prior fair-bandit methods, highlighting the approach’s practicality for large-scale distributed systems and suggesting extensions to Pareto-dominant allocations via weighting.

Abstract

In this paper, we study the problem of fair multi-agent multi-arm bandit learning when agents do not communicate with each other, except collision information, provided to agents accessing the same arm simultaneously. We provide an algorithm with regret $O\left(N^3 \log \frac{B}Δ f(\log T) \log T \right)$ (assuming bounded rewards, with unknown bound), where $f(t)$ is any function diverging to infinity with $t$. This significantly improves previous results which had the same upper bound on the regret of order $O(f(\log T) \log T )$ but an exponential dependence on the number of agents. The result is attained by using a distributed auction algorithm to learn the sample-optimal matching and a novel order-statistics-based regret analysis. Simulation results present the dependence of the regret on $\log T$.

Fair Multi-Agent Bandits

TL;DR

This work addresses fair resource allocation in a distributed, collision-based multi-agent bandit setting, where agents cannot communicate beyond collision signals. It introduces a three-phase framework—agent ordering, exploration, and distributed matching via a collision-enabled auction—to learn a max-min fair allocation and then exploit it. The authors prove a near-optimal regret bound of with unknown reward bounds, and show polynomial (in ) scalability where prior results had exponential dependence on the number of agents. Simulations demonstrate improved regret and faster convergence over prior fair-bandit methods, highlighting the approach’s practicality for large-scale distributed systems and suggesting extensions to Pareto-dominant allocations via weighting.

Abstract

In this paper, we study the problem of fair multi-agent multi-arm bandit learning when agents do not communicate with each other, except collision information, provided to agents accessing the same arm simultaneously. We provide an algorithm with regret (assuming bounded rewards, with unknown bound), where is any function diverging to infinity with . This significantly improves previous results which had the same upper bound on the regret of order but an exponential dependence on the number of agents. The result is attained by using a distributed auction algorithm to learn the sample-optimal matching and a novel order-statistics-based regret analysis. Simulation results present the dependence of the regret on .
Paper Structure (20 sections, 5 theorems, 24 equations, 3 figures, 5 algorithms)

This paper contains 20 sections, 5 theorems, 24 equations, 3 figures, 5 algorithms.

Key Result

Lemma 3.1

Let $x_1,\ldots,x_L$ be i.i.d random variables with cumulative distribution $F(x)$ with support $[0,B]$. Then

Figures (3)

  • Figure 1: Convergence of the algorithm in the setup of bistritz2021one. $L=50, c_3(k)=1.8^k$.
  • Figure 2: Evolution of the regret vs. time.
  • Figure 3: Evolution of the regret vs. number of agents.

Theorems & Definitions (7)

  • Remark 2.1: Bounded rewards
  • Remark 2.2: Continuously distributed rewards
  • Lemma 3.1
  • Lemma 3.2
  • Theorem 4.1
  • Lemma 4.2
  • Lemma 4.3