Fair Multi-Agent Bandits
Amir Leshem
TL;DR
This work addresses fair resource allocation in a distributed, collision-based multi-agent bandit setting, where agents cannot communicate beyond collision signals. It introduces a three-phase framework—agent ordering, exploration, and distributed matching via a collision-enabled auction—to learn a max-min fair allocation and then exploit it. The authors prove a near-optimal regret bound of $Reg(T) = O\left(N^3 \log \frac{B}{\Delta} f(\log T) \log T\right)$ with unknown reward bounds, and show polynomial (in $N$) scalability where prior results had exponential dependence on the number of agents. Simulations demonstrate improved regret and faster convergence over prior fair-bandit methods, highlighting the approach’s practicality for large-scale distributed systems and suggesting extensions to Pareto-dominant allocations via weighting.
Abstract
In this paper, we study the problem of fair multi-agent multi-arm bandit learning when agents do not communicate with each other, except collision information, provided to agents accessing the same arm simultaneously. We provide an algorithm with regret $O\left(N^3 \log \frac{B}Δ f(\log T) \log T \right)$ (assuming bounded rewards, with unknown bound), where $f(t)$ is any function diverging to infinity with $t$. This significantly improves previous results which had the same upper bound on the regret of order $O(f(\log T) \log T )$ but an exponential dependence on the number of agents. The result is attained by using a distributed auction algorithm to learn the sample-optimal matching and a novel order-statistics-based regret analysis. Simulation results present the dependence of the regret on $\log T$.
