Neural Population Learning beyond Symmetric Zero-sum Games

Siqi Liu; Luke Marris; Marc Lanctot; Georgios Piliouras; Joel Z. Leibo; Nicolas Heess

Neural Population Learning beyond Symmetric Zero-sum Games

Siqi Liu, Luke Marris, Marc Lanctot, Georgios Piliouras, Joel Z. Leibo, Nicolas Heess

TL;DR

This work introduces NeuPL-JPSRO, a scalable neural population learning algorithm that converges to a NF $CCE$ in n-player general-sum games by combining strategy embeddings, distillation, and regularisation with best-response learning. It extends prior JPSRO by sharing representations across policies to enable skill transfer and online adaptation, while employing a payoff-estimator network to efficiently evaluate metagame payoffs. Empirical results in OpenSpiel, MuJoCo cheetah-run, and multi-agent capture-the-flag demonstrate convergence toward $CCE$ and the emergence of coordinated, transferable skills, even under partial observability. The approach offers a practical pathway to solving real-world heterogeneous-agent interactions with mixed motives, balancing tractability, scalability, and convergence guarantees. Overall, NeuPL-JPSRO advances equilibrium-focused multiagent learning by marrying game-theoretic guarantees with deep representation learning and transfer across complex domains.

Abstract

We study computationally efficient methods for finding equilibria in n-player general-sum games, specifically ones that afford complex visuomotor skills. We show how existing methods would struggle in this setting, either computationally or in theory. We then introduce NeuPL-JPSRO, a neural population learning algorithm that benefits from transfer learning of skills and converges to a Coarse Correlated Equilibrium (CCE) of the game. We show empirical convergence in a suite of OpenSpiel games, validated rigorously by exact game solvers. We then deploy NeuPL-JPSRO to complex domains, where our approach enables adaptive coordination in a MuJoCo control domain and skill transfer in capture-the-flag. Our work shows that equilibrium convergent population learning can be implemented at scale and in generality, paving the way towards solving real-world games between heterogeneous players with mixed motives.

Neural Population Learning beyond Symmetric Zero-sum Games

TL;DR

This work introduces NeuPL-JPSRO, a scalable neural population learning algorithm that converges to a NF

in n-player general-sum games by combining strategy embeddings, distillation, and regularisation with best-response learning. It extends prior JPSRO by sharing representations across policies to enable skill transfer and online adaptation, while employing a payoff-estimator network to efficiently evaluate metagame payoffs. Empirical results in OpenSpiel, MuJoCo cheetah-run, and multi-agent capture-the-flag demonstrate convergence toward

and the emergence of coordinated, transferable skills, even under partial observability. The approach offers a practical pathway to solving real-world heterogeneous-agent interactions with mixed motives, balancing tractability, scalability, and convergence guarantees. Overall, NeuPL-JPSRO advances equilibrium-focused multiagent learning by marrying game-theoretic guarantees with deep representation learning and transfer across complex domains.

Abstract

Paper Structure (29 sections, 2 theorems, 4 equations, 10 figures, 2 tables, 2 algorithms)

This paper contains 29 sections, 2 theorems, 4 equations, 10 figures, 2 tables, 2 algorithms.

Introduction
Preliminaries
Coarse Correlated Equilibrium (CCE)
Joint Policy-Space Response Oracle (JPSRO)
NeuPL-JPSRO
Convergence to Equilibria
Scaling to large games
BR learning to CCE co-player mixed-strategies
BR learning to any co-player mixed-strategy
Expected payoff evaluation
Results
Convergence in n-player general-sum games
Online adaptation in multiagent MuJoCo domains
Strategic team-play in capture-the-flag
Related Work
...and 14 more sections

Key Result

Theorem 3.2

When using a CCE meta-strategy solver in NeuPL-JPSRO, and when distill and regularise operators are exact, the sequence of mixed-strategy converges to a normal-form CCE under the meta-strategy distribution.

Figures (10)

Figure 1: A turn-based two-player zero-sum game where player 1 publicly chooses a direction that player 2 is rewarded for avoiding. Terminal nodes show the payoffs of player 2.
Figure 2: Efficient best-response solving by reusing transferable representation from the policy population $\Pi^\mathcal{V}_\theta$. At iteration $t$, the policy head $\Pi_\phi$ (green) reuses the encoder and memory representation from $\Pi_\theta$ (gray) to learn a best-response to co-player mixed-strategy $\sigma^{t-1}_{\neg p}$. The best-response policy is concurrently distilled into the neural population of policies $\Pi^\mathcal{V}_\theta(\cdot | s, \nu^t_p)$ under the strategy embedding vector $\nu^t_p$.
Figure 3: Exact CCE gaps and CCE values in 6 OpenSpiel games for NeuPL-JPSRO (Blue) compared JPSRO (Red) using exact best-response and expected payoff solvers averaged over 5 seeds.
Figure 4: Emergence of cooperation in MuJoCo multiagent cheetah_run. (Left) Expected returns achieved at each iteration (solid) compared to the maximum return obtained by independent trials where players optimise through self-play (dashed). The average return at iteration 16 is comparable to that of SoTA single-agent RL shahriari2022revisiting(Middle) The sequence of CCE best-responded to at each iteration for player 1 (rear leg) and player 2 (front leg). (Right) Visualization of the learned behaviours at iteration 3 where the rear leg player raises the front leg player and at iteration 16 where both players cooperate competently.
Figure 5: (Left) A 4-player capture-the-flag environment showing first-person views for each player. (Right) Convergence to a CCE shown by the diminishing incentive to deviate to an independent BR across iterations. (Blue) Expected returns of independent RL exploiter policies optimised against the marginal CCE mixed-strategy $\sigma^t_{\neg p}$ from NeuPL-JPSRO at iteration $t$. (Orange) Same as Blue, but initialised with pre-trained encoder and memory network parameters (as in Figure \ref{['fig:scaling_br']}). (Red) CCE values at each NeuPL-JPSRO iteration. Solid lines show the optimistic best-of-six exploiter returns.
...and 5 more figures

Theorems & Definitions (4)

Definition 3.1: Full-Support
Theorem 3.2: CCE Convergence
Definition A.1: Unique Stochastic Policy Mapping
Lemma A.2: Finite Unique Stochastic Policies

Neural Population Learning beyond Symmetric Zero-sum Games

TL;DR

Abstract

Neural Population Learning beyond Symmetric Zero-sum Games

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (4)