Strategically Robust Multi-Agent Reinforcement Learning with Linear Function Approximation

Jake Gonzales; Max Horwitz; Eric Mazumdar; Lillian J. Ratliff

Strategically Robust Multi-Agent Reinforcement Learning with Linear Function Approximation

Jake Gonzales, Max Horwitz, Eric Mazumdar, Lillian J. Ratliff

TL;DR

This work studies Risk-Sensitive Quantal Response Equilibrium (RQRE), which yields a unique, smooth solution under bounded rationality and risk sensitivity, and proposes RQRE-OVI, an optimistic value iteration algorithm for computing RQRE with linear function approximation in large or continuous state spaces.

Abstract

Provably efficient and robust equilibrium computation in general-sum Markov games remains a core challenge in multi-agent reinforcement learning. Nash equilibrium is computationally intractable in general and brittle due to equilibrium multiplicity and sensitivity to approximation error. We study Risk-Sensitive Quantal Response Equilibrium (RQRE), which yields a unique, smooth solution under bounded rationality and risk sensitivity. We propose \texttt{RQRE-OVI}, an optimistic value iteration algorithm for computing RQRE with linear function approximation in large or continuous state spaces. Through finite-sample regret analysis, we establish convergence and explicitly characterize how sample complexity scales with rationality and risk-sensitivity parameters. The regret bounds reveal a quantitative tradeoff: increasing rationality tightens regret, while risk sensitivity induces regularization that enhances stability and robustness. This exposes a Pareto frontier between expected performance and robustness, with Nash recovered in the limit of perfect rationality and risk neutrality. We further show that the RQRE policy map is Lipschitz continuous in estimated payoffs, unlike Nash, and RQRE admits a distributionally robust optimization interpretation. Empirically, we demonstrate that \texttt{RQRE-OVI} achieves competitive performance under self-play while producing substantially more robust behavior under cross-play compared to Nash-based approaches. These results suggest \texttt{RQRE-OVI} offers a principled, scalable, and tunable path for equilibrium learning with improved robustness and generalization.

Strategically Robust Multi-Agent Reinforcement Learning with Linear Function Approximation

TL;DR

Abstract

Paper Structure (88 sections, 17 theorems, 195 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 88 sections, 17 theorems, 195 equations, 7 figures, 2 tables, 1 algorithm.

Introduction
Preliminaries
Bounded Rationality and Quantal Response Equilibrium
Risk-Sensitive Markov Games
Risk-adjusted loss in matrix games.
Risk-Averse Quantal Response Equilibrium (RQE).
Extension to Risk-Sensitive Markov games.
Risk-Sensitive QRE Optimistic Value Iteration
Regret Notion.
RQRE-OVI (Algorithm \ref{['alg:pr-er-rqre-ovi']}).
Regret Bounds in terms of Rationality and Risk Preferences
Distributional Robustness & Stability of RQRE
Distributional Robustness of RQRE
RQRE Admit Stability Properties that Yield Performance Guarantees
Numerical Experiments
...and 73 more sections

Key Result

Theorem 1

A mapping $\rho : \mathcal{Z} \to \mathbb{R}$ over a finite outcome space $\Omega$ is a convex risk measure if and only if there exists a convex, lower-semicontinuous penalty function $\varphi: \Delta(\Omega) \to (-\infty, \infty]$ such that $\rho(Z) = \sup_{p \in \Delta(\Omega)} \left\{ \mathop{\ma

Figures (7)

Figure 1: Self-play team return during training. Moving average of team return over episodes for Stag-Hunt (left) and Overcooked (right). In Stag-Hunt, higher $\tau$ drives agents toward the payoff dominant (stag, stag) outcome, while lower $\tau$ yields the more robust risk-dominant ( hare, hare) outcome. In Overcooked, all RQRE variants and QRE converge to comparable team returns, with Nash variants reaching similar or slightly lower levels.
Figure 2: Cross-play retention $(R(\delta)/R(0))$ as a function of perturbed partner noise ($\delta$): Stag-Hunt (left) and Overcooked (right). At each evaluation step, the partner's action is a fixed deterministic action (e.g., always move in one direction) with probability $\delta$ and otherwise follows its trained policy. This produces high-signal deviations to emphasize the robustness phenomena. Curves are normalized by the $\delta=0$ baseline, so higher values indicate strong robustness and lower values indicate performance degradation. Results are averaged over 200 evaluation rollouts per noise level.
Figure 3: Cross-play with unseen partners in Overcooked. Each point represents the reward of two agents trained under a different algorithm and paired at test time without ever seen each other before. The left panel shows RQRE-OVI agent's reward (vertical axis) versus NQ-OVI agent's reward (horizontal axis) for each pairing; the right panel reverses the roles. Points above the diagonal indicate that the agent on the vertical axis (in this case RQRE) achieves higher return than its partner. Labels denote the $\tau$ value of the corresponding RQRE-OVI agent, and the red diamond marks the QRE-OVI baseline score against NQ-OVI. Across all pairings, RQRE-OVI agents achieve equal or higher ego reward than their cross-play partner, with moderate $\tau$ values (e.g., $\tau=0.01$) yielding the strongest advantage.
Figure 4: Dynamic Stag Hunt (left) and Overcooked (right) environments used in experiments.
Figure 5: Stag Hunt outcome distributions during training. Each panel shows the fraction of stag--stag, hare--hare, and mixed interaction outcomes (rolling average) for a given algorithm and risk-aversion level. NQOVI, QRE, and low risk averse RQRE agents converge to payoff dominant stag--stag outcomes, while highly risk averse agents to risk dominant hare--hare, confirming the expected equilibrium selection.
...and 2 more figures

Theorems & Definitions (41)

Definition 1: Markov Nash Equilibrium
Definition 2: Quantal Response Equilibrium MCKELVEY19956
Example 1: Logit Quantal Response Equilibrium
Definition 3: Convex Risk Measure
Theorem 1: Dual Representation of Convex Risk Measures
Example 2: Entropic Risk
Definition 4: mazumdar2025tractable
Definition 5: Risk Quantal Response Equilibrium for Markov Games
Theorem 2: Regret bound
Corollary 1: Regret under entropic policy-risk and regularization
...and 31 more

Strategically Robust Multi-Agent Reinforcement Learning with Linear Function Approximation

TL;DR

Abstract

Strategically Robust Multi-Agent Reinforcement Learning with Linear Function Approximation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (41)