Table of Contents
Fetching ...

Evolution of Societies via Reinforcement Learning

Yann Bouteiller, Karthik Soma, Giovanni Beltrame

TL;DR

This work addresses the scalability challenge of studying social evolution when agents learn via MARL by deriving fast, exact PG and LOLA updates for symmetric normal-form games and batching updates across large populations. The authors demonstrate 200{,}000-agent evolutionary simulations in Stag Hunt, Hawk-Dove, and Rock-Paper-Scissors, revealing that LOLA can promote cooperation in SH, delay cooperative outcomes in HD, and reduce diversity in RPS, while the mean population policy tends toward Nash due to uniform random matching. The methodology provides a practical framework to analyze how advanced MARL revision protocols shape societal dynamics at scale and offers insights into when non-stationarity-aware learning may favor cooperation or diversity. This work lays groundwork for future extensions to episodic MARL and structured partner interactions in large populations, with open-source implementations to facilitate reproducibility and further research.

Abstract

The universe involves many independent co-learning agents as an ever-evolving part of our observed environment. Yet, in practice, Multi-Agent Reinforcement Learning (MARL) applications are typically constrained to small, homogeneous populations and remain computationally intensive. We propose a methodology that enables simulating populations of Reinforcement Learning agents at evolutionary scale. More specifically, we derive a fast, parallelizable implementation of Policy Gradient (PG) and Opponent-Learning Awareness (LOLA), tailored for evolutionary simulations where agents undergo random pairwise interactions in stateless normal-form games. We demonstrate our approach by simulating the evolution of very large populations made of heterogeneous co-learning agents, under both naive and advanced learning strategies. In our experiments, 200,000 PG or LOLA agents evolve in the classic games of Hawk-Dove, Stag-Hunt, and Rock-Paper-Scissors. Each game provides distinct insights into how populations evolve under both naive and advanced MARL rules, including compelling ways in which Opponent-Learning Awareness affects social evolution.

Evolution of Societies via Reinforcement Learning

TL;DR

This work addresses the scalability challenge of studying social evolution when agents learn via MARL by deriving fast, exact PG and LOLA updates for symmetric normal-form games and batching updates across large populations. The authors demonstrate 200{,}000-agent evolutionary simulations in Stag Hunt, Hawk-Dove, and Rock-Paper-Scissors, revealing that LOLA can promote cooperation in SH, delay cooperative outcomes in HD, and reduce diversity in RPS, while the mean population policy tends toward Nash due to uniform random matching. The methodology provides a practical framework to analyze how advanced MARL revision protocols shape societal dynamics at scale and offers insights into when non-stationarity-aware learning may favor cooperation or diversity. This work lays groundwork for future extensions to episodic MARL and structured partner interactions in large populations, with open-source implementations to facilitate reproducibility and further research.

Abstract

The universe involves many independent co-learning agents as an ever-evolving part of our observed environment. Yet, in practice, Multi-Agent Reinforcement Learning (MARL) applications are typically constrained to small, homogeneous populations and remain computationally intensive. We propose a methodology that enables simulating populations of Reinforcement Learning agents at evolutionary scale. More specifically, we derive a fast, parallelizable implementation of Policy Gradient (PG) and Opponent-Learning Awareness (LOLA), tailored for evolutionary simulations where agents undergo random pairwise interactions in stateless normal-form games. We demonstrate our approach by simulating the evolution of very large populations made of heterogeneous co-learning agents, under both naive and advanced learning strategies. In our experiments, 200,000 PG or LOLA agents evolve in the classic games of Hawk-Dove, Stag-Hunt, and Rock-Paper-Scissors. Each game provides distinct insights into how populations evolve under both naive and advanced MARL rules, including compelling ways in which Opponent-Learning Awareness affects social evolution.

Paper Structure

This paper contains 23 sections, 32 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Populations of 200,000 RL agents evolving in the classic games of Stag Hunt, Hawk-Dove and Rock-Paper-Scissors (columns), via Policy Gradient and LOLA (rows). Each agent is a stochastic policy, represented as linear coordinates between pure strategies. Dark shades of blue indicate high concentrations of agents, and evolution steps correspond to one learning step performed per agent. In Hawk-Dove, black dots indicate the average policy over the entire population.
  • Figure 2: Final average policy over the population, depending on cost values.
  • Figure 3: Pairing and batching
  • Figure 4: Duration of a full evolution step (lower is better)
  • Figure 5: Self-play. In SH and HD, the color marks the initial policy.
  • ...and 5 more figures

Theorems & Definitions (1)

  • proof