Table of Contents
Fetching ...

Open-ended Learning in Symmetric Zero-sum Games

David Balduzzi, Marta Garnelo, Yoram Bachrach, Wojciech M. Czarnecki, Julien Perolat, Max Jaderberg, Thore Graepel

TL;DR

The paper develops a geometric framework for open-ended learning in symmetric zero-sum games by modeling agents as parametrized strategies in functional-form games (FFGs) and introducing the concept of gamescapes. It decomposes FFGs into transitive and cyclic components via a Hodge-like decomposition, and defines population-level metrics—population performance and effective diversity—to guide learning beyond single-agent improvements. Two algorithms, PSRO_N and PSRO_rN, are proposed, with PSRO_rN leveraging niching to expand diverse, effective strategies, demonstrated to outperform baselines in highly nontransitive games like Blotto and differentiable Lotto. The work unifies gradient-based learning with game-theoretic objectives, formalizes the notion of an evolving strategy landscape, and provides tools to analyze and increase the exploration of strategic dimensions through adaptive objectives. Overall, PSRO_rN yields stronger, more diverse populations and opens avenues for robust open-ended learning in complex, nontransitive environments.

Abstract

Zero-sum games such as chess and poker are, abstractly, functions that evaluate pairs of agents, for example labeling them `winner' and `loser'. If the game is approximately transitive, then self-play generates sequences of agents of increasing strength. However, nontransitive games, such as rock-paper-scissors, can exhibit strategic cycles, and there is no longer a clear objective -- we want agents to increase in strength, but against whom is unclear. In this paper, we introduce a geometric framework for formulating agent objectives in zero-sum games, in order to construct adaptive sequences of objectives that yield open-ended learning. The framework allows us to reason about population performance in nontransitive games, and enables the development of a new algorithm (rectified Nash response, PSRO_rN) that uses game-theoretic niching to construct diverse populations of effective agents, producing a stronger set of agents than existing algorithms. We apply PSRO_rN to two highly nontransitive resource allocation games and find that PSRO_rN consistently outperforms the existing alternatives.

Open-ended Learning in Symmetric Zero-sum Games

TL;DR

The paper develops a geometric framework for open-ended learning in symmetric zero-sum games by modeling agents as parametrized strategies in functional-form games (FFGs) and introducing the concept of gamescapes. It decomposes FFGs into transitive and cyclic components via a Hodge-like decomposition, and defines population-level metrics—population performance and effective diversity—to guide learning beyond single-agent improvements. Two algorithms, PSRO_N and PSRO_rN, are proposed, with PSRO_rN leveraging niching to expand diverse, effective strategies, demonstrated to outperform baselines in highly nontransitive games like Blotto and differentiable Lotto. The work unifies gradient-based learning with game-theoretic objectives, formalizes the notion of an evolving strategy landscape, and provides tools to analyze and increase the exploration of strategic dimensions through adaptive objectives. Overall, PSRO_rN yields stronger, more diverse populations and opens avenues for robust open-ended learning in complex, nontransitive environments.

Abstract

Zero-sum games such as chess and poker are, abstractly, functions that evaluate pairs of agents, for example labeling them `winner' and `loser'. If the game is approximately transitive, then self-play generates sequences of agents of increasing strength. However, nontransitive games, such as rock-paper-scissors, can exhibit strategic cycles, and there is no longer a clear objective -- we want agents to increase in strength, but against whom is unclear. In this paper, we introduce a geometric framework for formulating agent objectives in zero-sum games, in order to construct adaptive sequences of objectives that yield open-ended learning. The framework allows us to reason about population performance in nontransitive games, and enables the development of a new algorithm (rectified Nash response, PSRO_rN) that uses game-theoretic niching to construct diverse populations of effective agents, producing a stronger set of agents than existing algorithms. We apply PSRO_rN to two highly nontransitive resource allocation games and find that PSRO_rN consistently outperforms the existing alternatives.

Paper Structure

This paper contains 43 sections, 19 theorems, 68 equations, 7 figures, 4 algorithms.

Key Result

Theorem 1

Every ${\mathsf{FFG}}$ decomposes into a sum of a transitive and cyclic game with respect to a suitably defined inner product.

Figures (7)

  • Figure 1: Low-dim gamescapes of various basic game structures.Top row: Evaluation matrices of populations of 40 agents each; colors vary from red to green as $\phi$ ranges over $[-1,1]$. Bottom row: 2-dim embedding obtained by using first 2 dimensions of Schur decomposition of the payoff matrix; Color corresponds to average payoff of an agent against entire population; ${\mathsf{EGS}}$ of the transitive game is a line; ${\mathsf{EGS}}$ of the cyclic game is two-dim near-circular polytope given by convex hull of points. For extended version see Figure \ref{['f:embeddings']} in the Appendix.
  • Figure 2: The disc game.A: A set of possible agents from the disc game is shown as blue dots. Three agents with non-transitive rock-paper-scissors relations are visualized in red. B: Three concentric gamescapes spanned by populations with rock-paper-scissor interactions of increasing strength.
  • Figure 3: A: Rock-paper-scissors. B: Gradient updates obtained from ${\mathsf{PSRO_{rN}}}$, amplifying strengths, grow gamescape (gray to blue). C: Gradients obtained by optimizing agents to reduces their losses shrink gamescape (gray to red).
  • Figure 4: Performance of ${\mathsf{PSRO_{rN}}}$ relative to self-play, ${\mathsf{PSRO_U}}$ and ${\mathsf{PSRO_N}}$ on Blotto (left) and Differentiable Lotto (right). In all cases, the relative performance of ${\mathsf{PSRO_{rN}}}$ is positive, and therefore outperforms the other algorithms.
  • Figure 5: Visualizations of training progress in Differentiable Lotto experiment. Left: Comparison of trajectories taken by each algorithm in the 2-dim Schur embedding of the ${\mathsf{EGS}}$; a black dot represents first agent found by the algorithm and a dashed line represents the convex full. Shaded blue region shows area of the convex hull of ${\mathsf{PSRO_{rN}}}$. Notice the ${\mathsf{PSRO_{rN}}}$ consistent expansion of the convex hull through ladder-like movements. See Figure \ref{['f:schurs']} for an extended version. Right: Area of convex hull spanned by populations over time. Note that only ${\mathsf{PSRO_{rN}}}$ consistently increases the convex hull in all iterations.
  • ...and 2 more figures

Theorems & Definitions (40)

  • Definition 1
  • Theorem 1: game decomposition
  • Example 1: Disc game
  • Example 2: Rock-paper-scissors embeds in disc game
  • Definition 2
  • Proposition 2
  • Proposition 3
  • Example 3: latent dimension of long cycles
  • Proposition 4
  • Definition 3
  • ...and 30 more