Training Generalizable Collaborative Agents via Strategic Risk Aversion

Chengrui Qu; Yizhou Zhang; Nicholas Lanzetti; Eric Mazumdar

Training Generalizable Collaborative Agents via Strategic Risk Aversion

Chengrui Qu, Yizhou Zhang, Nicholas Lanzetti, Eric Mazumdar

TL;DR

A multi-agent reinforcement learning (MARL) algorithm is developed that integrates strategic risk aversion into standard policy optimization methods and consistently achieves reliable collaboration with heterogeneous and previously unseen partners across collaborative tasks.

Abstract

Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals. Unfortunately, existing approaches to learning policies for such collaborative problems produce brittle solutions that fail when paired with new partners. We attribute these failures to a combination of free-riding during training and a lack of strategic robustness. To address these problems, we study the concept of strategic risk aversion and interpret it as a principled inductive bias for generalizable cooperation with unseen partners. While strategically risk-averse players are robust to deviations in their partner's behavior by design, we show that, in collaborative games, they also (1) can have better equilibrium outcomes than those at classical game-theoretic concepts like Nash, and (2) exhibit less or no free-riding. Inspired by these insights, we develop a multi-agent reinforcement learning (MARL) algorithm that integrates strategic risk aversion into standard policy optimization methods. Our empirical results across collaborative benchmarks (including an LLM collaboration task) validate our theory and demonstrate that our approach consistently achieves reliable collaboration with heterogeneous and previously unseen partners across collaborative tasks.

Training Generalizable Collaborative Agents via Strategic Risk Aversion

TL;DR

Abstract

Paper Structure (57 sections, 14 theorems, 95 equations, 14 figures, 2 tables, 1 algorithm)

This paper contains 57 sections, 14 theorems, 95 equations, 14 figures, 2 tables, 1 algorithm.

Introduction
Related Works
Problem Setup
Risk aversion
Bounded rationality
Risk-averse quantal response equilibrium (RQE).
"Free-Lunch" Theorems for Strategic Risk Aversion in Collaborative Games
Strategic Risk Aversion can Induce Collaboration
Strategic Risk Aversion Alleviates Free-riding
MARL Algorithm Design
Meta-algorithm for Strategically Risk-averse Policy Optimization
SRPO
Experiments
Training and evaluation.
Ablation studies.
...and 42 more sections

Key Result

theorem 1

Let $x_i^\star(\tau)$ be the Gaussian mixed strategy of player $i$ at the unique Gaussian RQE of the game, as a function of the degree of risk aversion $\tau$. Then, the expected shared reward $\tau\mapsto J(\tau)$ is strictly increasing. That is, players contribute more to the shared reward as they

Figures (14)

Figure 1: (i). Expected utility of each player as a function of the degree of risk aversion $\tau$ at equilibrium, when $\epsilon=1$ in \ref{['example:continuous_game']}. Player's utilities can first increase with risk aversion before decreasing due to over-conservatism, meaning that strategic risk aversion can yield better performing equilibria than Nash or QRE. (ii). Probability that a player collaborates at a RQE as a function of the level of risk aversion, for $\epsilon=0.2$ for the game in Example \ref{['example:free_riding']}. Strategic risk aversion alleviates free riding entirely after a given threshold (i.e., $\delta \rightarrow 0$) as our theory predicts.
Figure 2: Cross-play and ablation experiments in the overcooked environment. Each square represents the average reward across 10 episodes of length 128 for each pair of agents. Diagonal blocks represent the training performance of the agents. (i) We directly observe that IPPO ($\epsilon=0.1$) learns to free-ride while SRPO ($\tau=10, \epsilon=0.1)$ does not. Furthermore, mirroring \ref{['thm:rqe_cooperation']}, we observe that SRPO yields higher utility strategies (i.e., risk improves performance). (ii) Results of an ablation experiment, varying $\tau$ while holding $\epsilon=0.1$. We empirically observe that free-riding completely disappears as risk aversion increases, mirroring the result in \ref{['thm:free_riding']}. (iii) Difference between Training Performance (TP) and Cross-play Performance (CP) (mean and standard deviation): the performance of IPPO drastically decreases, with lower average and larger standard deviation in cross-play, while the performance of SRPO is unaffected.
Figure 3: Cross-play performances of SRPO ($\tau=10, \epsilon=0.01$) and IPPO ($\epsilon=0.01$) agents in the Tag environment against a runner seen during training (i) and an unseen runner (ii). Each square represents the average reward of two agents across 100 runs of length 100. IPPO does well in training environments (yet still clearly learns free-riding like policies), but their performance degrades drastically against an unseen runner. SRPO has slightly lower training performance but clearly learns a more generalizable policy. (iii) Difference between Training Performance (TP) and Cross-play Performance (CP) (mean and standard deviation): the performance of IPPO drastically decreases, with lower average and larger standard deviation in cross-play, while the performance of SRPO is almost unaffected.
Figure 4: Cross-play performance of SRPO and IPPO agents in the Hanabi environment. We use policy sharing to validate the scalability of SRPO. During evaluation, we let agents 1 and 2 share a policy and agents 3 and 4 share a policy, enabling pairwise cross-play evaluation. In \ref{['fig:hanabi-4']}, each square represents the average reward of the two agent groups across 100 runs, each of length 100. \ref{['fig:hanabi-drop']} shows the differences between training performance (TP) and cross-play performance (CP) (mean and standard deviation) for both IPPO and SRPO. SRPO remains more robust when paired with an unseen partner. Here, we set the entropy coefficient to be $\epsilon=0.001$ for both IPPO and SRPO, and $\tau=0.01$ for SRPO.
Figure 5: The Overcooked Gridworld environment.
...and 9 more figures

Theorems & Definitions (34)

definition 1: mazumdar2024tractableequilibriumcomputationmarkov, Definition 5
theorem 1: risk induces collaboration
Example 2
Remark 3
definition 2: free-riding
theorem 4: risk removes free-riding
Example 5
Remark 6
Lemma 7: risk-averse quantal best response
proof
...and 24 more

Training Generalizable Collaborative Agents via Strategic Risk Aversion

TL;DR

Abstract

Training Generalizable Collaborative Agents via Strategic Risk Aversion

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (34)