Taming Equilibrium Bias in Risk-Sensitive Multi-Agent Reinforcement Learning
Yingjie Fei, Ruitu Xu
TL;DR
This work addresses risk-sensitive multi-agent reinforcement learning in general-sum Markov games where agents optimize the entropic risk measure $V_m = \frac{1}{\beta_m}\log \mathbb{E}[e^{\beta_m R_m}]$ and may have heterogeneous risk preferences. It shows that naive regret definitions induce equilibrium bias toward the most risk-sensitive agents, and proposes risk-balanced regret to symmetrize performance across agents, along with a lower-bound analysis. A self-play algorithm, MARS-VI, combines risk-sensitive value iteration with optimistic exploration and an equilibrium solver to learn NE, CE, and CCE, achieving near-optimal guarantees with respect to risk-balanced regret. The results recover classical risk-neutral and single-agent regimes as special cases and provide the first finite-sample guarantees in risk-sensitive MARL, with practical implications for balanced policy design in finance and competitive environments.
Abstract
We study risk-sensitive multi-agent reinforcement learning under general-sum Markov games, where agents optimize the entropic risk measure of rewards with possibly diverse risk preferences. We show that using the regret naively adapted from existing literature as a performance metric could induce policies with equilibrium bias that favor the most risk-sensitive agents and overlook the other agents. To address such deficiency of the naive regret, we propose a novel notion of regret, which we call risk-balanced regret, and show through a lower bound that it overcomes the issue of equilibrium bias. Furthermore, we develop a self-play algorithm for learning Nash, correlated, and coarse correlated equilibria in risk-sensitive Markov games. We prove that the proposed algorithm attains near-optimal regret guarantees with respect to the risk-balanced regret.
