Fair Multi-Agent Bandits

Amir Leshem

Fair Multi-Agent Bandits

Amir Leshem

TL;DR

This work addresses fair resource allocation in a distributed, collision-based multi-agent bandit setting, where agents cannot communicate beyond collision signals. It introduces a three-phase framework—agent ordering, exploration, and distributed matching via a collision-enabled auction—to learn a max-min fair allocation and then exploit it. The authors prove a near-optimal regret bound of $Reg(T) = O\left(N^3 \log \frac{B}{\Delta} f(\log T) \log T\right)$ with unknown reward bounds, and show polynomial (in $N$) scalability where prior results had exponential dependence on the number of agents. Simulations demonstrate improved regret and faster convergence over prior fair-bandit methods, highlighting the approach’s practicality for large-scale distributed systems and suggesting extensions to Pareto-dominant allocations via weighting.

Abstract

In this paper, we study the problem of fair multi-agent multi-arm bandit learning when agents do not communicate with each other, except collision information, provided to agents accessing the same arm simultaneously. We provide an algorithm with regret $O\left(N^3 \log \frac{B}Δ f(\log T) \log T \right)$ (assuming bounded rewards, with unknown bound), where $f(t)$ is any function diverging to infinity with $t$. This significantly improves previous results which had the same upper bound on the regret of order $O(f(\log T) \log T )$ but an exponential dependence on the number of agents. The result is attained by using a distributed auction algorithm to learn the sample-optimal matching and a novel order-statistics-based regret analysis. Simulation results present the dependence of the regret on $\log T$.

Fair Multi-Agent Bandits

TL;DR

with unknown reward bounds, and show polynomial (in

) scalability where prior results had exponential dependence on the number of agents. Simulations demonstrate improved regret and faster convergence over prior fair-bandit methods, highlighting the approach’s practicality for large-scale distributed systems and suggesting extensions to Pareto-dominant allocations via weighting.

Abstract

(assuming bounded rewards, with unknown bound), where

is any function diverging to infinity with

. This significantly improves previous results which had the same upper bound on the regret of order

but an exponential dependence on the number of agents. The result is attained by using a distributed auction algorithm to learn the sample-optimal matching and a novel order-statistics-based regret analysis. Simulation results present the dependence of the regret on

Paper Structure (20 sections, 5 theorems, 24 equations, 3 figures, 5 algorithms)

This paper contains 20 sections, 5 theorems, 24 equations, 3 figures, 5 algorithms.

Introduction
Prior Work on fair bandit learning
Contributions and limitations
The max-min fair bandit problem
Learning a max-min optimal allocation
Agent's ordering
Exploration Phase
Matching
Exploitation
Regret analysis
Probability of matching error
Simulations
Conclusion
Simulations for various number of agents
Proof of Lemma \ref{['lem:exploration_error']}
...and 5 more sections

Key Result

Lemma 3.1

Let $x_1,\ldots,x_L$ be i.i.d random variables with cumulative distribution $F(x)$ with support $[0,B]$. Then

Figures (3)

Figure 1: Convergence of the algorithm in the setup of bistritz2021one. $L=50, c_3(k)=1.8^k$.
Figure 2: Evolution of the regret vs. time.
Figure 3: Evolution of the regret vs. number of agents.

Theorems & Definitions (7)

Remark 2.1: Bounded rewards
Remark 2.2: Continuously distributed rewards
Lemma 3.1
Lemma 3.2
Theorem 4.1
Lemma 4.2
Lemma 4.3

Fair Multi-Agent Bandits

TL;DR

Abstract

Fair Multi-Agent Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (7)