Table of Contents
Fetching ...

Taming Equilibrium Bias in Risk-Sensitive Multi-Agent Reinforcement Learning

Yingjie Fei, Ruitu Xu

TL;DR

This work addresses risk-sensitive multi-agent reinforcement learning in general-sum Markov games where agents optimize the entropic risk measure $V_m = \frac{1}{\beta_m}\log \mathbb{E}[e^{\beta_m R_m}]$ and may have heterogeneous risk preferences. It shows that naive regret definitions induce equilibrium bias toward the most risk-sensitive agents, and proposes risk-balanced regret to symmetrize performance across agents, along with a lower-bound analysis. A self-play algorithm, MARS-VI, combines risk-sensitive value iteration with optimistic exploration and an equilibrium solver to learn NE, CE, and CCE, achieving near-optimal guarantees with respect to risk-balanced regret. The results recover classical risk-neutral and single-agent regimes as special cases and provide the first finite-sample guarantees in risk-sensitive MARL, with practical implications for balanced policy design in finance and competitive environments.

Abstract

We study risk-sensitive multi-agent reinforcement learning under general-sum Markov games, where agents optimize the entropic risk measure of rewards with possibly diverse risk preferences. We show that using the regret naively adapted from existing literature as a performance metric could induce policies with equilibrium bias that favor the most risk-sensitive agents and overlook the other agents. To address such deficiency of the naive regret, we propose a novel notion of regret, which we call risk-balanced regret, and show through a lower bound that it overcomes the issue of equilibrium bias. Furthermore, we develop a self-play algorithm for learning Nash, correlated, and coarse correlated equilibria in risk-sensitive Markov games. We prove that the proposed algorithm attains near-optimal regret guarantees with respect to the risk-balanced regret.

Taming Equilibrium Bias in Risk-Sensitive Multi-Agent Reinforcement Learning

TL;DR

This work addresses risk-sensitive multi-agent reinforcement learning in general-sum Markov games where agents optimize the entropic risk measure and may have heterogeneous risk preferences. It shows that naive regret definitions induce equilibrium bias toward the most risk-sensitive agents, and proposes risk-balanced regret to symmetrize performance across agents, along with a lower-bound analysis. A self-play algorithm, MARS-VI, combines risk-sensitive value iteration with optimistic exploration and an equilibrium solver to learn NE, CE, and CCE, achieving near-optimal guarantees with respect to risk-balanced regret. The results recover classical risk-neutral and single-agent regimes as special cases and provide the first finite-sample guarantees in risk-sensitive MARL, with practical implications for balanced policy design in finance and competitive environments.

Abstract

We study risk-sensitive multi-agent reinforcement learning under general-sum Markov games, where agents optimize the entropic risk measure of rewards with possibly diverse risk preferences. We show that using the regret naively adapted from existing literature as a performance metric could induce policies with equilibrium bias that favor the most risk-sensitive agents and overlook the other agents. To address such deficiency of the naive regret, we propose a novel notion of regret, which we call risk-balanced regret, and show through a lower bound that it overcomes the issue of equilibrium bias. Furthermore, we develop a self-play algorithm for learning Nash, correlated, and coarse correlated equilibria in risk-sensitive Markov games. We prove that the proposed algorithm attains near-optimal regret guarantees with respect to the risk-balanced regret.
Paper Structure (24 sections, 10 theorems, 107 equations, 3 algorithms)

This paper contains 24 sections, 10 theorems, 107 equations, 3 algorithms.

Key Result

Theorem 4.1

For $H \geq 8$, $K \geq \max\{16e^{|\beta_*|(H-1)}, 16H\}$, and $\log\log K \gtrsim |\beta_*|(H-1)$, there exists an MG such that any algorithm obeys The same bound holds for $\mathbb{E}[ \overline{\mathop{\mathrm{\mathrm{Regret}}}\nolimits}_{\mathsf{CE}}(K) ]$ and $\mathbb{E}[ \overline{\mathop{\mathrm{\mathrm{Regret}}}\nolimits}_{\mathsf{CCE}}(K) ]$.

Theorems & Definitions (19)

  • Theorem 4.1
  • Definition 4.2
  • Definition 4.3
  • Theorem 4.4
  • Theorem 6.1
  • Lemma A.1
  • proof
  • Lemma B.1
  • proof
  • Lemma B.2
  • ...and 9 more