A Survey on Self-play Methods in Reinforcement Learning

Ruize Zhang; Zelai Xu; Chengdong Ma; Chao Yu; Wei-Wei Tu; Wenhao Tang; Shiyu Huang; Deheng Ye; Wenbo Ding; Yaodong Yang; Yu Wang

A Survey on Self-play Methods in Reinforcement Learning

Ruize Zhang, Zelai Xu, Chengdong Ma, Chao Yu, Wei-Wei Tu, Wenhao Tang, Shiyu Huang, Deheng Ye, Wenbo Ding, Yaodong Yang, Yu Wang

TL;DR

This survey provides a systematic road map of self-play in non-cooperative multi-agent reinforcement learning, framing the problem with MARL and game-theoretic preliminaries and then unifying diverse algorithms under a single framework. It classifies self-play methods into traditional self-play, PSRO, ongoing-training-based, and regret-minimization-based families, detailing how each fits within the Pi/Sigma/MSS/Oracle structure and how opportune learning signals are generated. The empirical analysis spans Go, Stratego, Texas Hold’em, DouDiZhu, Mahjong, StarCraft II, MOBA games, and Google Research Football, illustrating how self-play achieves superhuman performance and where limitations persist. The paper also discusses open theoretical gaps, non-stationarity, scalability, and the potential for integrating large language models, aiming to guide future algorithm design and real-world applications with a rigorous, framework-driven perspective.

Abstract

Self-play, a learning paradigm where agents iteratively refine their policies by interacting with historical or concurrent versions of themselves or other evolving agents, has shown remarkable success in solving complex non-cooperative multi-agent tasks. Despite its growing prominence in multi-agent reinforcement learning (MARL), such as Go, poker, and video games, a comprehensive and structured understanding of self-play remains lacking. This survey fills this gap by offering a comprehensive roadmap to the diverse landscape of self-play methods. We begin by introducing the necessary preliminaries, including the MARL framework and basic game theory concepts. Then, it provides a unified framework and classifies existing self-play algorithms within this framework. Moreover, the paper bridges the gap between the algorithms and their practical implications by illustrating the role of self-play in different non-cooperative scenarios. Finally, the survey highlights open challenges and future research directions in self-play.

A Survey on Self-play Methods in Reinforcement Learning

TL;DR

Abstract

Paper Structure (57 sections, 4 theorems, 10 equations, 4 figures, 2 tables, 6 algorithms)

This paper contains 57 sections, 4 theorems, 10 equations, 4 figures, 2 tables, 6 algorithms.

Introduction
Preliminaries
MARL Framework
Game Theory Concepts
Normal-Form and Extensive-Form
Transitive Game and Non-Transitive Game
Stage Game and Repeated Game
Nash Equilibrium
Static Game and Dynamic Game
Algorithms
Framework Definition
Traditional Self-Play Algorithms
Integration into Our Framework
Vanilla Self-Play
Fictitious Play
...and 42 more sections

Key Result

Corollary 1

In traditional self-play algorithms, the interaction matrix $\Sigma$ is a lower triangular matrix.

Figures (4)

Figure 1: Overview of our survey.
Figure 2: The example of the market entry game. (a) Matrix representation of simultaneous market entry game in normal-form. (b) Game tree representation of sequential market entry game in extensive-form.
Figure 3: Illustration of our framework. The top row shows two different initialization methods for the policy population $\Pi$ and the interaction matrix $\Sigma$: lazy initialization and immediate initialization (corresponding to Line \ref{['framework_line:initialize_1']} in Algo. \ref{['alg:framework']}). The bottom row illustrates two training paradigms: standard training and ongoing training (corresponding to Lines \ref{['framework_line:epoch']}\ref{['framework_line:end_for']} in Algo. \ref{['alg:framework']}). We categorize self-play algorithms into four types: traditional self-play, the PSRO series, the ongoing-training-based series, and the regret-minimization-based series. Among these, traditional self-play, the PSRO series, and the regret-minimization-based series utilize lazy initialization and standard training, whereas the ongoing-training-based series employs immediate initialization and ongoing training. For a detailed description and analysis, please refer to Sec.\ref{['sec:algorithms']} and Table\ref{['table:algs']}.
Figure 4: Interaction matrix $\Sigma$ examples. When $C_1=K$ and the opponent sampling strategy $\Sigma_{mn}$ represents the probability that policy $m$ is optimized against policy $n$, the interaction matrix $\Sigma$ can be depicted in directed interaction graphs. Here, we consider three representative self-play algorithms. In the Top section, we define the interaction matrix $\Sigma\in \mathbb{R}^{3\times 3}$ as $\{\sigma_{[k]}\}_{k=1}^3$. In the Bottom section, we present directed interaction graphs where the outgoing edges from each node are equally weighted, and their weights collectively sum to one. The relationship between the Top and Bottom sections is established through directed edges: an edge directed from node $m$ to node $n$ with a weight of $\Sigma_{mn}$ signifies that policy $m$ is optimized against policy $n$ with a probability of $\Sigma_{mn}$. Note that this figure is reproduced from liu2022neupl and this concept is initially proposed by garnelo2021pick.

Theorems & Definitions (4)

Corollary 1
Corollary 2
Corollary 3
Corollary 4

A Survey on Self-play Methods in Reinforcement Learning

TL;DR

Abstract

A Survey on Self-play Methods in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (4)