SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Bo Liu; Leon Guertler; Simon Yu; Zichen Liu; Penghui Qi; Daniel Balcells; Mickel Liu; Cheston Tan; Weiyan Shi; Min Lin; Wee Sun Lee; Natasha Jaques

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques

TL;DR

SPIRAL presents a fully online, multi-turn, multi-agent reinforcement learning framework that trains LLMs to reason by playing zero-sum games against evolving opponents. By using a single shared policy conditioned on player roles and introducing Role-conditioned Advantage Estimation, SPIRAL stabilizes training and prevents thinking collapse, enabling continuous improvement. Empirically, self-play on Kuhn Poker transfers to math and general reasoning benchmarks, with notable gains and patterns (Case-by-Case Analysis, Expected Value Calculation, Pattern Recognition) that transfer across domains; multi-game training further yields synergistic benefits and transfers to unseen games. The results imply zero-sum games can act as scalable reasoning environments, complementing or surpassing domain-specific supervised data and fixed-opponent MARL baselines, and they extend even to strong pre-existing reasoning models. This work suggests a path toward autonomous, self-improving reasoning systems driven by adversarial, multi-turn curricula.

Abstract

Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

TL;DR

Abstract

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)