Table of Contents
Fetching ...

Strategically Robust Multi-Agent Reinforcement Learning with Linear Function Approximation

Jake Gonzales, Max Horwitz, Eric Mazumdar, Lillian J. Ratliff

TL;DR

This work studies Risk-Sensitive Quantal Response Equilibrium (RQRE), which yields a unique, smooth solution under bounded rationality and risk sensitivity, and proposes RQRE-OVI, an optimistic value iteration algorithm for computing RQRE with linear function approximation in large or continuous state spaces.

Abstract

Provably efficient and robust equilibrium computation in general-sum Markov games remains a core challenge in multi-agent reinforcement learning. Nash equilibrium is computationally intractable in general and brittle due to equilibrium multiplicity and sensitivity to approximation error. We study Risk-Sensitive Quantal Response Equilibrium (RQRE), which yields a unique, smooth solution under bounded rationality and risk sensitivity. We propose \texttt{RQRE-OVI}, an optimistic value iteration algorithm for computing RQRE with linear function approximation in large or continuous state spaces. Through finite-sample regret analysis, we establish convergence and explicitly characterize how sample complexity scales with rationality and risk-sensitivity parameters. The regret bounds reveal a quantitative tradeoff: increasing rationality tightens regret, while risk sensitivity induces regularization that enhances stability and robustness. This exposes a Pareto frontier between expected performance and robustness, with Nash recovered in the limit of perfect rationality and risk neutrality. We further show that the RQRE policy map is Lipschitz continuous in estimated payoffs, unlike Nash, and RQRE admits a distributionally robust optimization interpretation. Empirically, we demonstrate that \texttt{RQRE-OVI} achieves competitive performance under self-play while producing substantially more robust behavior under cross-play compared to Nash-based approaches. These results suggest \texttt{RQRE-OVI} offers a principled, scalable, and tunable path for equilibrium learning with improved robustness and generalization.

Strategically Robust Multi-Agent Reinforcement Learning with Linear Function Approximation

TL;DR

This work studies Risk-Sensitive Quantal Response Equilibrium (RQRE), which yields a unique, smooth solution under bounded rationality and risk sensitivity, and proposes RQRE-OVI, an optimistic value iteration algorithm for computing RQRE with linear function approximation in large or continuous state spaces.

Abstract

Provably efficient and robust equilibrium computation in general-sum Markov games remains a core challenge in multi-agent reinforcement learning. Nash equilibrium is computationally intractable in general and brittle due to equilibrium multiplicity and sensitivity to approximation error. We study Risk-Sensitive Quantal Response Equilibrium (RQRE), which yields a unique, smooth solution under bounded rationality and risk sensitivity. We propose \texttt{RQRE-OVI}, an optimistic value iteration algorithm for computing RQRE with linear function approximation in large or continuous state spaces. Through finite-sample regret analysis, we establish convergence and explicitly characterize how sample complexity scales with rationality and risk-sensitivity parameters. The regret bounds reveal a quantitative tradeoff: increasing rationality tightens regret, while risk sensitivity induces regularization that enhances stability and robustness. This exposes a Pareto frontier between expected performance and robustness, with Nash recovered in the limit of perfect rationality and risk neutrality. We further show that the RQRE policy map is Lipschitz continuous in estimated payoffs, unlike Nash, and RQRE admits a distributionally robust optimization interpretation. Empirically, we demonstrate that \texttt{RQRE-OVI} achieves competitive performance under self-play while producing substantially more robust behavior under cross-play compared to Nash-based approaches. These results suggest \texttt{RQRE-OVI} offers a principled, scalable, and tunable path for equilibrium learning with improved robustness and generalization.
Paper Structure (88 sections, 17 theorems, 195 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 88 sections, 17 theorems, 195 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

A mapping $\rho : \mathcal{Z} \to \mathbb{R}$ over a finite outcome space $\Omega$ is a convex risk measure if and only if there exists a convex, lower-semicontinuous penalty function $\varphi: \Delta(\Omega) \to (-\infty, \infty]$ such that $\rho(Z) = \sup_{p \in \Delta(\Omega)} \left\{ \mathop{\ma

Figures (7)

  • Figure 1: Self-play team return during training. Moving average of team return over episodes for Stag-Hunt (left) and Overcooked (right). In Stag-Hunt, higher $\tau$ drives agents toward the payoff dominant (stag, stag) outcome, while lower $\tau$ yields the more robust risk-dominant ( hare, hare) outcome. In Overcooked, all RQRE variants and QRE converge to comparable team returns, with Nash variants reaching similar or slightly lower levels.
  • Figure 2: Cross-play retention $(R(\delta)/R(0))$ as a function of perturbed partner noise ($\delta$): Stag-Hunt (left) and Overcooked (right). At each evaluation step, the partner's action is a fixed deterministic action (e.g., always move in one direction) with probability $\delta$ and otherwise follows its trained policy. This produces high-signal deviations to emphasize the robustness phenomena. Curves are normalized by the $\delta=0$ baseline, so higher values indicate strong robustness and lower values indicate performance degradation. Results are averaged over 200 evaluation rollouts per noise level.
  • Figure 3: Cross-play with unseen partners in Overcooked. Each point represents the reward of two agents trained under a different algorithm and paired at test time without ever seen each other before. The left panel shows RQRE-OVI agent's reward (vertical axis) versus NQ-OVI agent's reward (horizontal axis) for each pairing; the right panel reverses the roles. Points above the diagonal indicate that the agent on the vertical axis (in this case RQRE) achieves higher return than its partner. Labels denote the $\tau$ value of the corresponding RQRE-OVI agent, and the red diamond marks the QRE-OVI baseline score against NQ-OVI. Across all pairings, RQRE-OVI agents achieve equal or higher ego reward than their cross-play partner, with moderate $\tau$ values (e.g., $\tau=0.01$) yielding the strongest advantage.
  • Figure 4: Dynamic Stag Hunt (left) and Overcooked (right) environments used in experiments.
  • Figure 5: Stag Hunt outcome distributions during training. Each panel shows the fraction of stag--stag, hare--hare, and mixed interaction outcomes (rolling average) for a given algorithm and risk-aversion level. NQOVI, QRE, and low risk averse RQRE agents converge to payoff dominant stag--stag outcomes, while highly risk averse agents to risk dominant hare--hare, confirming the expected equilibrium selection.
  • ...and 2 more figures

Theorems & Definitions (41)

  • Definition 1: Markov Nash Equilibrium
  • Definition 2: Quantal Response Equilibrium MCKELVEY19956
  • Example 1: Logit Quantal Response Equilibrium
  • Definition 3: Convex Risk Measure
  • Theorem 1: Dual Representation of Convex Risk Measures
  • Example 2: Entropic Risk
  • Definition 4: mazumdar2025tractable
  • Definition 5: Risk Quantal Response Equilibrium for Markov Games
  • Theorem 2: Regret bound
  • Corollary 1: Regret under entropic policy-risk and regularization
  • ...and 31 more