Table of Contents
Fetching ...

Robust and Diverse Multi-Agent Learning via Rational Policy Gradient

Niklas Lauffer, Ameesh Shah, Micah Carroll, Sanjit A. Seshia, Stuart Russell, Michael Dennis

TL;DR

The paper addresses the challenge of learning robust policies in general-sum multi-agent settings where naive adversarial training induces self-sabotage. It introduces Rationality-Preserving Policy Optimization (RPO) to enforce rationality by requiring adversaries to be best responses to some co-policy, and Rational Policy Gradient (RPG) to optimize adversarial objectives under this constraint. RPG uses manipulators to shape the learning of base agents while propagating higher-order gradients, enabling several algorithms (AP-RPG, AT-RPG, PAIRED-RPG, PAIRED-A-RPG, AD-RPG) that find rational adversarial examples, improve robustness, and discover genuinely diverse strategies. Empirically, RPG-based methods outperform baselines in cooperative and mixed-motive environments (e.g., STORM, Overcooked, Hanabi) by avoiding self-sabotage, achieving higher cross-play rewards, and producing non-trivial adversarial weaknesses. Overall, the work extends adversarial optimization from zero-sum to general-sum MARL, offering a unified framework for robustness and diversity via rational adversarial learning.

Abstract

Adversarial optimization algorithms that explicitly search for flaws in agents' policies have been successfully applied to finding robust and diverse policies in multi-agent settings. However, the success of adversarial optimization has been largely limited to zero-sum settings because its naive application in cooperative settings leads to a critical failure mode: agents are irrationally incentivized to self-sabotage, blocking the completion of tasks and halting further learning. To address this, we introduce Rationality-preserving Policy Optimization (RPO), a formalism for adversarial optimization that avoids self-sabotage by ensuring agents remain rational--that is, their policies are optimal with respect to some possible partner policy. To solve RPO, we develop Rational Policy Gradient (RPG), which trains agents to maximize their own reward in a modified version of the original game in which we use opponent shaping techniques to optimize the adversarial objective. RPG enables us to extend a variety of existing adversarial optimization algorithms that, no longer subject to the limitations of self-sabotage, can find adversarial examples, improve robustness and adaptability, and learn diverse policies. We empirically validate that our approach achieves strong performance in several popular cooperative and general-sum environments. Our project page can be found at https://rational-policy-gradient.github.io.

Robust and Diverse Multi-Agent Learning via Rational Policy Gradient

TL;DR

The paper addresses the challenge of learning robust policies in general-sum multi-agent settings where naive adversarial training induces self-sabotage. It introduces Rationality-Preserving Policy Optimization (RPO) to enforce rationality by requiring adversaries to be best responses to some co-policy, and Rational Policy Gradient (RPG) to optimize adversarial objectives under this constraint. RPG uses manipulators to shape the learning of base agents while propagating higher-order gradients, enabling several algorithms (AP-RPG, AT-RPG, PAIRED-RPG, PAIRED-A-RPG, AD-RPG) that find rational adversarial examples, improve robustness, and discover genuinely diverse strategies. Empirically, RPG-based methods outperform baselines in cooperative and mixed-motive environments (e.g., STORM, Overcooked, Hanabi) by avoiding self-sabotage, achieving higher cross-play rewards, and producing non-trivial adversarial weaknesses. Overall, the work extends adversarial optimization from zero-sum to general-sum MARL, offering a unified framework for robustness and diversity via rational adversarial learning.

Abstract

Adversarial optimization algorithms that explicitly search for flaws in agents' policies have been successfully applied to finding robust and diverse policies in multi-agent settings. However, the success of adversarial optimization has been largely limited to zero-sum settings because its naive application in cooperative settings leads to a critical failure mode: agents are irrationally incentivized to self-sabotage, blocking the completion of tasks and halting further learning. To address this, we introduce Rationality-preserving Policy Optimization (RPO), a formalism for adversarial optimization that avoids self-sabotage by ensuring agents remain rational--that is, their policies are optimal with respect to some possible partner policy. To solve RPO, we develop Rational Policy Gradient (RPG), which trains agents to maximize their own reward in a modified version of the original game in which we use opponent shaping techniques to optimize the adversarial objective. RPG enables us to extend a variety of existing adversarial optimization algorithms that, no longer subject to the limitations of self-sabotage, can find adversarial examples, improve robustness and adaptability, and learn diverse policies. We empirically validate that our approach achieves strong performance in several popular cooperative and general-sum environments. Our project page can be found at https://rational-policy-gradient.github.io.

Paper Structure

This paper contains 52 sections, 11 equations, 21 figures, 2 tables, 1 algorithm.

Figures (21)

  • Figure 1: Rational policy gradient (RPG) allows finding rational adversarial examples, robustifying behavior, and discovering diverse policies.
  • Figure 2: A cooperative game.
  • Figure 3: RPG update with lookahead $N$
  • Figure 4: Left: training curves for each learning agent in AD-RPG. Self-play rewards increase while cross-play scores decrease (but notably never reach zero). Manipulator's only reach the level of reward necessary to influence the base agents to be diverse. Shaded region shows 95% confidence interval. Right: self-play and cross-play rewards for adversarial diversity-based algorithms in the cramped room Overcooked layout. Self-sabotage leads to deceptively low cross-play rewards for CoMeDi and AD while AD-RPG maintains high score even in cross-play.
  • Figure 5: Cross-play grids between different algorithms across environments. Each square represents the average reward from a specific pair of seeds trained by one of the three algorithms when paired as teammates. Standard error < 1 for Overcooked and < 0.1 for Hanabi for all values.
  • ...and 16 more figures

Theorems & Definitions (2)

  • Definition 3.1: Self-sabotage
  • Definition 3.2: Rationality-preserving Policy Optimization (RPO)