Table of Contents
Fetching ...

Provably Convergent Actor-Critic in Risk-averse MARL

Yizhou Zhang, Eric Mazumdar

TL;DR

This work demonstrates that RQE possesses strong regularity conditions that make it uniquely amenable to learning in MGs, and proposes a novel two-timescale Actor-Critic algorithm characterized by a fast-timescale actor and a slow-timescale critic that achieves global convergence with finite-sample guarantees.

Abstract

Learning stationary policies in infinite-horizon general-sum Markov games (MGs) remains a fundamental open problem in Multi-Agent Reinforcement Learning (MARL). While stationary strategies are preferred for their practicality, computing stationary forms of classic game-theoretic equilibria is computationally intractable -- a stark contrast to the comparative ease of solving single-agent RL or zero-sum games. To bridge this gap, we study Risk-averse Quantal response Equilibria (RQE), a solution concept rooted in behavioral game theory that incorporates risk aversion and bounded rationality. We demonstrate that RQE possesses strong regularity conditions that make it uniquely amenable to learning in MGs. We propose a novel two-timescale Actor-Critic algorithm characterized by a fast-timescale actor and a slow-timescale critic. Leveraging the regularity of RQE, we prove that this approach achieves global convergence with finite-sample guarantees. We empirically validate our algorithm in several environments to demonstrate superior convergence properties compared to risk-neutral baselines.

Provably Convergent Actor-Critic in Risk-averse MARL

TL;DR

This work demonstrates that RQE possesses strong regularity conditions that make it uniquely amenable to learning in MGs, and proposes a novel two-timescale Actor-Critic algorithm characterized by a fast-timescale actor and a slow-timescale critic that achieves global convergence with finite-sample guarantees.

Abstract

Learning stationary policies in infinite-horizon general-sum Markov games (MGs) remains a fundamental open problem in Multi-Agent Reinforcement Learning (MARL). While stationary strategies are preferred for their practicality, computing stationary forms of classic game-theoretic equilibria is computationally intractable -- a stark contrast to the comparative ease of solving single-agent RL or zero-sum games. To bridge this gap, we study Risk-averse Quantal response Equilibria (RQE), a solution concept rooted in behavioral game theory that incorporates risk aversion and bounded rationality. We demonstrate that RQE possesses strong regularity conditions that make it uniquely amenable to learning in MGs. We propose a novel two-timescale Actor-Critic algorithm characterized by a fast-timescale actor and a slow-timescale critic. Leveraging the regularity of RQE, we prove that this approach achieves global convergence with finite-sample guarantees. We empirically validate our algorithm in several environments to demonstrate superior convergence properties compared to risk-neutral baselines.
Paper Structure (58 sections, 25 theorems, 267 equations, 9 figures, 4 tables, 2 algorithms)

This paper contains 58 sections, 25 theorems, 267 equations, 9 figures, 4 tables, 2 algorithms.

Key Result

Proposition 1

Let $z^*=(\pi^*,p^*)$ be a Nash equilibrium of the 4-player game characterized by eq:intro_4player_objective and eq:intro_4player_adv_objective. We have that $\pi^*$ is an RQE of the original two-player game characterized by eq:intro_risk_averse_regularized_objective. Furthermore, if $\pi^*$ is an R

Figures (9)

  • Figure 1: RQE uniqueness region for KL/log-barrier or reverse KL/negative entropy. Green captures the region indicated by Theorems \ref{['thm:RQE_property_4player_monotonicity']} and \ref{['thm:monotonicity_condition']}. Orange captures that in zhang2025convergent and blue captures that in mazumdar2024tractableequilibriumcomputationmarkov.
  • Figure 2: GD dynamics for different $\tau$ with KL and log-barrier risks, $\epsilon_i=0.2$ and $\tau_1=\tau_2$.
  • Figure 3: MA100 agent 0 reward curves of gridworld cooperation game for 10 risk-averse and 10 risk-neutral training runs.
  • Figure 4: MA100 reward curves of Simple Tag fixing good agents for 5 risk-averse and 5 risk-neutral training runs.
  • Figure 5: Gridworld Layout. Agent 0 and agent 1 are shown in blue and red dots on the upper-left corner. The defection zones are painted in blue (for agent 0) and red (for agent 1)
  • ...and 4 more figures

Theorems & Definitions (47)

  • Definition 1: mazumdar2024tractableequilibriumcomputationmarkov, Definition 5
  • Proposition 1: mazumdar2024tractableequilibriumcomputationmarkov, Proposition 1
  • Definition 2
  • Definition 3: Stationary Markov RQE
  • Definition 4
  • Proposition 2
  • Theorem 3.1
  • Theorem 3.2
  • Proposition 3
  • Theorem 4.1
  • ...and 37 more