Table of Contents
Fetching ...

Advantage Alignment Algorithms

Juan Agustin Duque, Milad Aghajohari, Tim Cooijmans, Razvan Ciuca, Tianyu Zhang, Gauthier Gidel, Aaron Courville

TL;DR

This work tackles self-interested cooperation in general-sum multi-agent RL by deriving Advantage Alignment, a simple, first-principles objective that aligns agents' advantages to steer trajectories toward mutually beneficial actions.Advantage Alignment unifies and simplifies prior opponent-shaping methods (LOLA, LOQA), and extends to continuous actions via Proximal Advantage Alignment with a PPO-style surrogate, while preserving Nash equilibria.Empirical results across Iterated Prisoner's Dilemma, Coin Game, Negotiation Game, and Melting Pot Commons Harvest Open demonstrate state-of-the-art cooperation and robustness to exploitation, including scalable performance in high-dimensional, partially observable settings.

Abstract

Artificially intelligent agents are increasingly being integrated into human decision-making: from large language model (LLM) assistants to autonomous vehicles. These systems often optimize their individual objective, leading to conflicts, particularly in general-sum games where naive reinforcement learning agents empirically converge to Pareto-suboptimal Nash equilibria. To address this issue, opponent shaping has emerged as a paradigm for finding socially beneficial equilibria in general-sum games. In this work, we introduce Advantage Alignment, a family of algorithms derived from first principles that perform opponent shaping efficiently and intuitively. We achieve this by aligning the advantages of interacting agents, increasing the probability of mutually beneficial actions when their interaction has been positive. We prove that existing opponent shaping methods implicitly perform Advantage Alignment. Compared to these methods, Advantage Alignment simplifies the mathematical formulation of opponent shaping, reduces the computational burden and extends to continuous action domains. We demonstrate the effectiveness of our algorithms across a range of social dilemmas, achieving state-of-the-art cooperation and robustness against exploitation.

Advantage Alignment Algorithms

TL;DR

This work tackles self-interested cooperation in general-sum multi-agent RL by deriving Advantage Alignment, a simple, first-principles objective that aligns agents' advantages to steer trajectories toward mutually beneficial actions.Advantage Alignment unifies and simplifies prior opponent-shaping methods (LOLA, LOQA), and extends to continuous actions via Proximal Advantage Alignment with a PPO-style surrogate, while preserving Nash equilibria.Empirical results across Iterated Prisoner's Dilemma, Coin Game, Negotiation Game, and Melting Pot Commons Harvest Open demonstrate state-of-the-art cooperation and robustness to exploitation, including scalable performance in high-dimensional, partially observable settings.

Abstract

Artificially intelligent agents are increasingly being integrated into human decision-making: from large language model (LLM) assistants to autonomous vehicles. These systems often optimize their individual objective, leading to conflicts, particularly in general-sum games where naive reinforcement learning agents empirically converge to Pareto-suboptimal Nash equilibria. To address this issue, opponent shaping has emerged as a paradigm for finding socially beneficial equilibria in general-sum games. In this work, we introduce Advantage Alignment, a family of algorithms derived from first principles that perform opponent shaping efficiently and intuitively. We achieve this by aligning the advantages of interacting agents, increasing the probability of mutually beneficial actions when their interaction has been positive. We prove that existing opponent shaping methods implicitly perform Advantage Alignment. Compared to these methods, Advantage Alignment simplifies the mathematical formulation of opponent shaping, reduces the computational burden and extends to continuous action domains. We demonstrate the effectiveness of our algorithms across a range of social dilemmas, achieving state-of-the-art cooperation and robustness against exploitation.
Paper Structure (37 sections, 5 theorems, 56 equations, 8 figures, 4 tables, 2 algorithms)

This paper contains 37 sections, 5 theorems, 56 equations, 8 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Given a two-player game where players 1 and 2 have respective policies $\pi^1(a|s)$ and $\pi^2(b|s)$, where each policy is parametrised such that the set of gradients $\nabla_{\theta_2} \log \pi^2(a|s)$ for all pairs $(a,s)$ form an orthonormal basis, the LOLA update for the first player correspond where $A^i_k := A^{i}(s_k, a_k, b_k)$ and $d_{\gamma,k}$ is the occupancy measure of the tuple $(a

Figures (8)

  • Figure 1: (a) The sign of the product of the gamma-discounted past advantages for the agent, and the current advantage of the opponent, indicates whether the probability of taking an action should increase or decrease. (b) The empirical probability of cooperation of Advantage Alignment for each previous combination of actions in the one step history Iterated Prisoner's Dilemma, closely resembles tit-for-tat. Results are averaged over 10 random seeds, the black whiskers show one std.
  • Figure 2: League Results of the Advantage Alignment agents in Coin Game: LOQA, POLA, MFOS, Always Cooperate (AC), Always Defect (AD), Random and Advantage Alignment (AdAlign). Each number in the plot is computed by running 10 random seeds of each agent head to head with 10 seeds of another for 50 episodes of length 16 and averaging the rewards.
  • Figure 3: (a) League Results of the Advantage Alignment agents in the Negotiation Game: Always Cooperate (AC), an agent which proposes $5$ for items which are more valuable to it and $1$ for items that are less valuable to it, Always Defect (AD), an agent that proposes $5$ regardless of the values, Advantage Alignment (AdAlign), PPO and PPO summing rewards (PPO-SR). Each number in the plot is computed by running 10 random seeds of each agent head to head with 10 seeds of another for 50 episodes of length 16 and averaging the rewards. Note that against Always Defect, Always Cooperate gets an average return of $0.25$ while Always Defect gets $0.30$. (b) Sample trajectories of AdAlign vs. AdAlign and PPO vs. PPO in the negotiation game. The numbers show the utilities and proposals, which have been rounded to integer values. AdAlign agents defect first (red) and progressively cooperate with each other (blue) while PPO agents Always Defect.
  • Figure 4: Comparison of different reinforcement learning algorithms in Melting Pot's 2.0. Commons Harvest Open. The score is the focal return per capita, min-max normalized between a random agent and an exploiter baseline (ACB agent with an LSTM policy/value network) trained for $10^9$ steps. Following the protocol of the Melting Pot contest, we select the best agent out of 10 seeds and evaluate it 100 times.
  • Figure 5: Frames of evaluation trajectories for different algorithms. Qualitatively, we demonstrate that Proximal Advantage Alignment (adalign) also outperforms naive PPO (ppo) and PPO with summed rewards. The evaluation trajectories show how adalign agents are able to maintain a bigger number of apple bushes from extinction (2) for a longer time that either ppo or ppo_p. Note that in the Commons Harvest evaluation two exploiter agents, green and yellow, play against a focal population of 5 copies of the evaluated algorithm.
  • ...and 3 more figures

Theorems & Definitions (10)

  • Theorem 1: LOLA as an advantage alignment estimator
  • Theorem 2: LOQA as an advantage alignment estimator
  • Theorem 3: Advantage Alignment preserves Nash equilibria
  • Lemma 1: Policy changes under gradient ascent
  • proof
  • proof
  • proof
  • Lemma 2: Zero Advantages At Nash
  • proof
  • proof