Table of Contents
Fetching ...

Learning Robust Social Strategies with Large Language Models

Dereck Piche, Mohammed Muqeeth, Milad Aghajohari, Juan Duque, Michael Noukhovitch, Aaron Courville

TL;DR

This work investigates how LLMs behave in multi-agent social dilemmas under reinforcement learning. Naive MARL drives LLMs toward greedy, exploitable strategies across diverse environments and even exposes vulnerabilities in advanced closed-source models. The authors adapt Advantage Alignment, with a group-relative baseline and an LoRA-based agent buffer, to train LLMs toward cooperative and non-exploitable behavior, demonstrated on IPD, Split No-Comm, and Trust-and-Split, including a communication-enabled Trust-and-Split scenario. Results show improved collective welfare and robustness to exploitation, including tit-for-tat-like and grim-trigger-like strategies, and the approach remains effective against adversarial RL opponents. The work also introduces a scalable social-dilemma testbed and releases code to support future multi-agent RL research for LLMs.

Abstract

As agentic AI becomes more widespread, agents with distinct and possibly conflicting goals will interact in complex ways. These multi-agent interactions pose a fundamental challenge, particularly in social dilemmas, where agents' individual incentives can undermine collective welfare. While reinforcement learning (RL) has been effective for aligning large language models (LLMs) in the single-agent regime, prior small-network results suggest that standard RL in multi-agent settings often converges to defecting, self-interested policies. We show the same effect in LLMs: despite cooperative priors, RL-trained LLM agents develop opportunistic behavior that can exploit even advanced closed-source models. To address this tendency of RL to converge to poor equilibria, we adapt a recent opponent-learning awareness algorithm, Advantage Alignment, to fine-tune LLMs toward multi-agent cooperation and non-exploitability. We then introduce a group-relative baseline that simplifies advantage computation in iterated games, enabling multi-agent training at LLM scale. We also contribute a novel social dilemma environment, Trust-and-Split, which requires natural language communication to achieve high collective welfare. Across a wide range of social dilemmas, policies learned with Advantage Alignment achieve higher collective payoffs while remaining robust against exploitation by greedy agents. We release all of our code to support future work on multi-agent RL training for LLMs.

Learning Robust Social Strategies with Large Language Models

TL;DR

This work investigates how LLMs behave in multi-agent social dilemmas under reinforcement learning. Naive MARL drives LLMs toward greedy, exploitable strategies across diverse environments and even exposes vulnerabilities in advanced closed-source models. The authors adapt Advantage Alignment, with a group-relative baseline and an LoRA-based agent buffer, to train LLMs toward cooperative and non-exploitable behavior, demonstrated on IPD, Split No-Comm, and Trust-and-Split, including a communication-enabled Trust-and-Split scenario. Results show improved collective welfare and robustness to exploitation, including tit-for-tat-like and grim-trigger-like strategies, and the approach remains effective against adversarial RL opponents. The work also introduces a scalable social-dilemma testbed and releases code to support future multi-agent RL research for LLMs.

Abstract

As agentic AI becomes more widespread, agents with distinct and possibly conflicting goals will interact in complex ways. These multi-agent interactions pose a fundamental challenge, particularly in social dilemmas, where agents' individual incentives can undermine collective welfare. While reinforcement learning (RL) has been effective for aligning large language models (LLMs) in the single-agent regime, prior small-network results suggest that standard RL in multi-agent settings often converges to defecting, self-interested policies. We show the same effect in LLMs: despite cooperative priors, RL-trained LLM agents develop opportunistic behavior that can exploit even advanced closed-source models. To address this tendency of RL to converge to poor equilibria, we adapt a recent opponent-learning awareness algorithm, Advantage Alignment, to fine-tune LLMs toward multi-agent cooperation and non-exploitability. We then introduce a group-relative baseline that simplifies advantage computation in iterated games, enabling multi-agent training at LLM scale. We also contribute a novel social dilemma environment, Trust-and-Split, which requires natural language communication to achieve high collective welfare. Across a wide range of social dilemmas, policies learned with Advantage Alignment achieve higher collective payoffs while remaining robust against exploitation by greedy agents. We release all of our code to support future work on multi-agent RL training for LLMs.

Paper Structure

This paper contains 23 sections, 2 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: One round of Trust-and-Split. Each player receives a private rock-paper-scissors hand that determines how much they value the coins, sends one message in turn, and then submits a proposal. Payoffs follow the split rule. Both hands and proposals are revealed before the next round starts.
  • Figure 2: Training curves of multi-agent GRPO on several open-source LLMs across IPD, Split No-Comm, and Trust-and-Split. In all environments, average rewards converge to the greedy payoff levels, showing that naive MARL drives LLMs toward defecting strategies in social dilemmas.
  • Figure 3: Average rewards when evaluating an Advantage Alignment (AdAlign) agent, an always-cooperate (Coop) agent, and an always-defect (Defect) agent. In IPD (left) and Split No-Comm (right), Advantage Alignment achieves near cooperative payoffs with itself and always-cooperate (Coop) while remaining robust against always-defect (Defect). Results are averaged over 8 seeds.
  • Figure 4: Average reward in Trust-and-Split when pitting an Advantage Alignment (AdAlign) agent against agents trained with multi-agent GRPO with sum of rewards i.e. Cooperators (GRPO-SR) and multi-agent GRPO i.e. Defectors (GRPO). Advantage Alignment cooperates with cooperative partners and itself, yet avoids being exploited by greedy agents. Results are averaged over 8 seeds.
  • Figure 5: Example Trust-and-Split interaction showing the tit-for-tat behavior learned by Advantage Alignment. After Bob defects (as seen in the prompt summary of Alice for round 7), Alice defects in round 7, then returns to cooperation in round 9 once Bob cooperates again (shown in the prompt summary of Alice for round 9).
  • ...and 9 more figures