Table of Contents
Fetching ...

No-Regret Learning for Fair Multi-Agent Social Welfare Optimization

Mengxiao Zhang, Ramiro Deo-Campo Vuong, Haipeng Luo

TL;DR

This work investigates online learning for multi-agent NSW maximization, using NSW as the fairness-aware objective and exploring both stochastic and adversarial environments under bandit and full-information feedback. It develops a Bernstein-based UCB algorithm for the stochastic NSW bandit setting, achieving a near-optimal $ ilde{O}(K^{2/N} T^{(N-1)/N})$ regret and establishing a tight lower bound, highlighting a stark contrast with NSW_prod's $ ilde{O}(\sqrt{T})$ regret. In adversarial settings with bandit feedback, the authors prove an impossibility result for sublinear regret, while in the full-information setting they design two Follow-the-Regularized-Leader (FTRL) based algorithms—one with a log-barrier regularizer (independent of $N$) and another with Tsallis entropy—that both attain $ ilde{O}( oot T ext{K} ext{ log }T)$-type guarantees, with favorable trade-offs in $K$ and $N$. The paper also shows a special case where logarithmic regret is possible under exp-concavity when some agents are indifferent on certain rounds. Overall, the results reveal fundamental differences between NSW and NSW_prod and demonstrate how feedback richness dramatically improves learnability, with broad implications for fair online resource allocation in dynamic, multi-agent settings.

Abstract

We consider the problem of online multi-agent Nash social welfare (NSW) maximization. While previous works of Hossain et al. [2021], Jones et al. [2023] study similar problems in stochastic multi-agent multi-armed bandits and show that $\sqrt{T}$-regret is possible after $T$ rounds, their fairness measure is the product of all agents' rewards, instead of their NSW (that is, their geometric mean). Given the fundamental role of NSW in the fairness literature, it is more than natural to ask whether no-regret fair learning with NSW as the objective is possible. In this work, we provide a complete answer to this question in various settings. Specifically, in stochastic $N$-agent $K$-armed bandits, we develop an algorithm with $\widetilde{\mathcal{O}}\left(K^{\frac{2}{N}}T^{\frac{N-1}{N}}\right)$ regret and prove that the dependence on $T$ is tight, making it a sharp contrast to the $\sqrt{T}$-regret bounds of Hossain et al. [2021], Jones et al. [2023]. We then consider a more challenging version of the problem with adversarial rewards. Somewhat surprisingly, despite NSW being a concave function, we prove that no algorithm can achieve sublinear regret. To circumvent such negative results, we further consider a setting with full-information feedback and design two algorithms with $\sqrt{T}$-regret: the first one has no dependence on $N$ at all and is applicable to not just NSW but a broad class of welfare functions, while the second one has better dependence on $K$ and is preferable when $N$ is small. Finally, we also show that logarithmic regret is possible whenever there exists one agent who is indifferent about different arms.

No-Regret Learning for Fair Multi-Agent Social Welfare Optimization

TL;DR

This work investigates online learning for multi-agent NSW maximization, using NSW as the fairness-aware objective and exploring both stochastic and adversarial environments under bandit and full-information feedback. It develops a Bernstein-based UCB algorithm for the stochastic NSW bandit setting, achieving a near-optimal regret and establishing a tight lower bound, highlighting a stark contrast with NSW_prod's regret. In adversarial settings with bandit feedback, the authors prove an impossibility result for sublinear regret, while in the full-information setting they design two Follow-the-Regularized-Leader (FTRL) based algorithms—one with a log-barrier regularizer (independent of ) and another with Tsallis entropy—that both attain -type guarantees, with favorable trade-offs in and . The paper also shows a special case where logarithmic regret is possible under exp-concavity when some agents are indifferent on certain rounds. Overall, the results reveal fundamental differences between NSW and NSW_prod and demonstrate how feedback richness dramatically improves learnability, with broad implications for fair online resource allocation in dynamic, multi-agent settings.

Abstract

We consider the problem of online multi-agent Nash social welfare (NSW) maximization. While previous works of Hossain et al. [2021], Jones et al. [2023] study similar problems in stochastic multi-agent multi-armed bandits and show that -regret is possible after rounds, their fairness measure is the product of all agents' rewards, instead of their NSW (that is, their geometric mean). Given the fundamental role of NSW in the fairness literature, it is more than natural to ask whether no-regret fair learning with NSW as the objective is possible. In this work, we provide a complete answer to this question in various settings. Specifically, in stochastic -agent -armed bandits, we develop an algorithm with regret and prove that the dependence on is tight, making it a sharp contrast to the -regret bounds of Hossain et al. [2021], Jones et al. [2023]. We then consider a more challenging version of the problem with adversarial rewards. Somewhat surprisingly, despite NSW being a concave function, we prove that no algorithm can achieve sublinear regret. To circumvent such negative results, we further consider a setting with full-information feedback and design two algorithms with -regret: the first one has no dependence on at all and is applicable to not just NSW but a broad class of welfare functions, while the second one has better dependence on and is preferable when is small. Finally, we also show that logarithmic regret is possible whenever there exists one agent who is indifferent about different arms.
Paper Structure (29 sections, 18 theorems, 57 equations, 2 algorithms)

This paper contains 29 sections, 18 theorems, 57 equations, 2 algorithms.

Key Result

Theorem 3.1

With $N_0=1+18\log KT$, alg:UCB_Berstein guarantees $\mathbb{E}\left[\text{\rm Reg}_{\mathrm{sto}}\right]=\widetilde{\mathcal{O}}(K^{\frac{2}{N}}T^{\frac{N-1}{N}}+K)$.

Theorems & Definitions (31)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 4.1
  • proof
  • Theorem 4.2
  • proof : Proof Sketch
  • Theorem 4.3
  • Theorem 4.4
  • Theorem A.1
  • proof
  • ...and 21 more