No-Regret Learning for Fair Multi-Agent Social Welfare Optimization

Mengxiao Zhang; Ramiro Deo-Campo Vuong; Haipeng Luo

No-Regret Learning for Fair Multi-Agent Social Welfare Optimization

Mengxiao Zhang, Ramiro Deo-Campo Vuong, Haipeng Luo

TL;DR

This work investigates online learning for multi-agent NSW maximization, using NSW as the fairness-aware objective and exploring both stochastic and adversarial environments under bandit and full-information feedback. It develops a Bernstein-based UCB algorithm for the stochastic NSW bandit setting, achieving a near-optimal $ ilde{O}(K^{2/N} T^{(N-1)/N})$ regret and establishing a tight lower bound, highlighting a stark contrast with NSW_prod's $ ilde{O}(\sqrt{T})$ regret. In adversarial settings with bandit feedback, the authors prove an impossibility result for sublinear regret, while in the full-information setting they design two Follow-the-Regularized-Leader (FTRL) based algorithms—one with a log-barrier regularizer (independent of $N$) and another with Tsallis entropy—that both attain $ ilde{O}( oot T ext{K} ext{ log }T)$-type guarantees, with favorable trade-offs in $K$ and $N$. The paper also shows a special case where logarithmic regret is possible under exp-concavity when some agents are indifferent on certain rounds. Overall, the results reveal fundamental differences between NSW and NSW_prod and demonstrate how feedback richness dramatically improves learnability, with broad implications for fair online resource allocation in dynamic, multi-agent settings.

Abstract

We consider the problem of online multi-agent Nash social welfare (NSW) maximization. While previous works of Hossain et al. [2021], Jones et al. [2023] study similar problems in stochastic multi-agent multi-armed bandits and show that $\sqrt{T}$-regret is possible after $T$ rounds, their fairness measure is the product of all agents' rewards, instead of their NSW (that is, their geometric mean). Given the fundamental role of NSW in the fairness literature, it is more than natural to ask whether no-regret fair learning with NSW as the objective is possible. In this work, we provide a complete answer to this question in various settings. Specifically, in stochastic $N$-agent $K$-armed bandits, we develop an algorithm with $\widetilde{\mathcal{O}}\left(K^{\frac{2}{N}}T^{\frac{N-1}{N}}\right)$ regret and prove that the dependence on $T$ is tight, making it a sharp contrast to the $\sqrt{T}$-regret bounds of Hossain et al. [2021], Jones et al. [2023]. We then consider a more challenging version of the problem with adversarial rewards. Somewhat surprisingly, despite NSW being a concave function, we prove that no algorithm can achieve sublinear regret. To circumvent such negative results, we further consider a setting with full-information feedback and design two algorithms with $\sqrt{T}$-regret: the first one has no dependence on $N$ at all and is applicable to not just NSW but a broad class of welfare functions, while the second one has better dependence on $K$ and is preferable when $N$ is small. Finally, we also show that logarithmic regret is possible whenever there exists one agent who is indifferent about different arms.

No-Regret Learning for Fair Multi-Agent Social Welfare Optimization

TL;DR

regret and establishing a tight lower bound, highlighting a stark contrast with NSW_prod's

regret. In adversarial settings with bandit feedback, the authors prove an impossibility result for sublinear regret, while in the full-information setting they design two Follow-the-Regularized-Leader (FTRL) based algorithms—one with a log-barrier regularizer (independent of

) and another with Tsallis entropy—that both attain

-type guarantees, with favorable trade-offs in

and

. The paper also shows a special case where logarithmic regret is possible under exp-concavity when some agents are indifferent on certain rounds. Overall, the results reveal fundamental differences between NSW and NSW_prod and demonstrate how feedback richness dramatically improves learnability, with broad implications for fair online resource allocation in dynamic, multi-agent settings.

Abstract

-regret is possible after

rounds, their fairness measure is the product of all agents' rewards, instead of their NSW (that is, their geometric mean). Given the fundamental role of NSW in the fairness literature, it is more than natural to ask whether no-regret fair learning with NSW as the objective is possible. In this work, we provide a complete answer to this question in various settings. Specifically, in stochastic

-agent

-armed bandits, we develop an algorithm with

regret and prove that the dependence on

is tight, making it a sharp contrast to the

-regret bounds of Hossain et al. [2021], Jones et al. [2023]. We then consider a more challenging version of the problem with adversarial rewards. Somewhat surprisingly, despite NSW being a concave function, we prove that no algorithm can achieve sublinear regret. To circumvent such negative results, we further consider a setting with full-information feedback and design two algorithms with

-regret: the first one has no dependence on

at all and is applicable to not just NSW but a broad class of welfare functions, while the second one has better dependence on

and is preferable when

is small. Finally, we also show that logarithmic regret is possible whenever there exists one agent who is indifferent about different arms.

Paper Structure (29 sections, 18 theorems, 57 equations, 2 algorithms)

This paper contains 29 sections, 18 theorems, 57 equations, 2 algorithms.

Introduction
Related Work
Preliminaries
General Notation.
Social Welfare Functions
Nash Social Welfare (NSW)
Problem Setup.
Connection to Bandit Convex optimization.
Stochastic Environments with Bandit Feedback
Upper Bound: a Refined Analysis of UCB with a Bernstein-Type Confidence Set
Lower Bound
Adversarial Environments
Impossibility Results with Bandit Feedback
Full-Information Feedback
FTRL with Log-Barrier Regularizer
...and 14 more sections

Key Result

Theorem 3.1

With $N_0=1+18\log KT$, alg:UCB_Berstein guarantees $\mathbb{E}\left[\text{\rm Reg}_{\mathrm{sto}}\right]=\widetilde{\mathcal{O}}(K^{\frac{2}{N}}T^{\frac{N-1}{N}}+K)$.

Theorems & Definitions (31)

Theorem 3.1
Theorem 3.2
Theorem 4.1
proof
Theorem 4.2
proof : Proof Sketch
Theorem 4.3
Theorem 4.4
Theorem A.1
proof
...and 21 more

No-Regret Learning for Fair Multi-Agent Social Welfare Optimization

TL;DR

Abstract

No-Regret Learning for Fair Multi-Agent Social Welfare Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (31)