Table of Contents
Fetching ...

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques

TL;DR

SPIRAL presents a fully online, multi-turn, multi-agent reinforcement learning framework that trains LLMs to reason by playing zero-sum games against evolving opponents. By using a single shared policy conditioned on player roles and introducing Role-conditioned Advantage Estimation, SPIRAL stabilizes training and prevents thinking collapse, enabling continuous improvement. Empirically, self-play on Kuhn Poker transfers to math and general reasoning benchmarks, with notable gains and patterns (Case-by-Case Analysis, Expected Value Calculation, Pattern Recognition) that transfer across domains; multi-game training further yields synergistic benefits and transfers to unseen games. The results imply zero-sum games can act as scalable reasoning environments, complementing or surpassing domain-specific supervised data and fixed-opponent MARL baselines, and they extend even to strong pre-existing reasoning models. This work suggests a path toward autonomous, self-improving reasoning systems driven by adversarial, multi-turn curricula.

Abstract

Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

TL;DR

SPIRAL presents a fully online, multi-turn, multi-agent reinforcement learning framework that trains LLMs to reason by playing zero-sum games against evolving opponents. By using a single shared policy conditioned on player roles and introducing Role-conditioned Advantage Estimation, SPIRAL stabilizes training and prevents thinking collapse, enabling continuous improvement. Empirically, self-play on Kuhn Poker transfers to math and general reasoning benchmarks, with notable gains and patterns (Case-by-Case Analysis, Expected Value Calculation, Pattern Recognition) that transfer across domains; multi-game training further yields synergistic benefits and transfers to unseen games. The results imply zero-sum games can act as scalable reasoning environments, complementing or surpassing domain-specific supervised data and fixed-opponent MARL baselines, and they extend even to strong pre-existing reasoning models. This work suggests a path toward autonomous, self-improving reasoning systems driven by adversarial, multi-turn curricula.

Abstract

Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.

Paper Structure

This paper contains 44 sections, 10 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: SPIRAL achieves consistent improvements over base models across game performance and reasoning benchmarks. It also surpasses SFT on expert game trajectories and RL baselines trained against fixed opponents (Mistral and Gemini).
  • Figure 2: From human-designed rewards to self-discovered reasoning through SPIRAL. Left: Traditional RL requires human experts to design complex reward functions. Middle: Fixed opponent training leads to exploitation of static strategies. Right: SPIRAL enables continuous reasoning improvement through self-play, where both players develop increasingly sophisticated strategies without human supervision.
  • Figure 3: The SPIRAL Framework. SPIRAL employs an actor-learner architecture for scalable self-play training. Parallel actors sample trajectories from a diverse set of games using vectorized environments. A single policy $\pi_i$ plays both roles, generating zero-sum, sparse reward game trajectories. The centralized learner processes these trajectories using Role-conditioned Advantage Estimation (RAE) to compute separate advantages, $A_0(s,a)$ and $A_1(s,a)$, for each role. These are then used for on-policy reinforcement learning updates.
  • Figure 4: Evolution of reasoning patterns during SPIRAL training and their transfer to mathematical reasoning. We track three core reasoning patterns (Pattern Recognition, Expected Value Calculation, and Case-by-Case Analysis) across 290 game trajectories and 46,792 math solutions. Left: In the game domain, all patterns show substantial growth, with Expected Value Calculation reaching 78% by late training. Middle: These patterns transfer to mathematical reasoning with varying effectiveness: Case-by-Case Analysis maintains high transfer (72% to 71%), Pattern Recognition shows amplification (35% to 45%), while Expected Value Calculation transfers more selectively (78% to 28%). Right: Math benchmark scores improve from 31.2 to 39.6 as these reasoning patterns develop, demonstrating that game-learned strategies enhance mathematical problem-solving capabilities.
  • Figure 5: Performance comparison of self-play training and fixed-opponent baselines. All evaluations are averaged over multiple games/benchmarks (see \ref{['sec:eval_metrics']}). Mistral Opponent refers to against Mistral-Small-3; Gemini Opponent refers to against Gemini-2.0-Flash-Lite.
  • ...and 3 more figures