Table of Contents
Fetching ...

Towards Understanding Self-play for LLM Reasoning

Justin Yang Chae, Md Tanvirul Alam, Nidhi Rastogi

TL;DR

This work analyzes self-play for LLM reasoning via the Absolute Zero Reasoner (AZR), comparing it against RLVR and SFT to uncover training dynamics that accompany reasoning gains. By examining pass@k performance, entropy dynamics, and parameter update sparsity across two AZR sizes, the study shows that self-play improves reasoning at low $k$ but remains bounded by the base model's capacity at high $k$, with gains likely arising from distributional sharpening and co-evolutionary data diversity. The results highlight the proposer as the crucial component, revealing that AZR generates increasingly difficult curricula and adjusts response length with problem difficulty, while exhibiting entropy collapse and intermediate update sparsity. The paper suggests future directions including expanding self-play frameworks, automatic curriculum design, and entropy-regulation strategies to push LLM math reasoning beyond current limits.

Abstract

Recent advances in large language model (LLM) reasoning, led by reinforcement learning with verifiable rewards (RLVR), have inspired self-play post-training, where models improve by generating and solving their own problems. While self-play has shown strong in-domain and out-of-domain gains, the mechanisms behind these improvements remain poorly understood. In this work, we analyze the training dynamics of self-play through the lens of the Absolute Zero Reasoner, comparing it against RLVR and supervised fine-tuning (SFT). Our study examines parameter update sparsity, entropy dynamics of token distributions, and alternative proposer reward functions. We further connect these dynamics to reasoning performance using pass@k evaluations. Together, our findings clarify how self-play differs from other post-training strategies, highlight its inherent limitations, and point toward future directions for improving LLM math reasoning through self-play.

Towards Understanding Self-play for LLM Reasoning

TL;DR

This work analyzes self-play for LLM reasoning via the Absolute Zero Reasoner (AZR), comparing it against RLVR and SFT to uncover training dynamics that accompany reasoning gains. By examining pass@k performance, entropy dynamics, and parameter update sparsity across two AZR sizes, the study shows that self-play improves reasoning at low but remains bounded by the base model's capacity at high , with gains likely arising from distributional sharpening and co-evolutionary data diversity. The results highlight the proposer as the crucial component, revealing that AZR generates increasingly difficult curricula and adjusts response length with problem difficulty, while exhibiting entropy collapse and intermediate update sparsity. The paper suggests future directions including expanding self-play frameworks, automatic curriculum design, and entropy-regulation strategies to push LLM math reasoning beyond current limits.

Abstract

Recent advances in large language model (LLM) reasoning, led by reinforcement learning with verifiable rewards (RLVR), have inspired self-play post-training, where models improve by generating and solving their own problems. While self-play has shown strong in-domain and out-of-domain gains, the mechanisms behind these improvements remain poorly understood. In this work, we analyze the training dynamics of self-play through the lens of the Absolute Zero Reasoner, comparing it against RLVR and supervised fine-tuning (SFT). Our study examines parameter update sparsity, entropy dynamics of token distributions, and alternative proposer reward functions. We further connect these dynamics to reasoning performance using pass@k evaluations. Together, our findings clarify how self-play differs from other post-training strategies, highlight its inherent limitations, and point toward future directions for improving LLM math reasoning through self-play.

Paper Structure

This paper contains 24 sections, 2 theorems, 10 equations, 7 figures.

Key Result

Theorem 1

Initialize with $\pi^{\mathrm P}_0=\pi^{\mathrm S}_0=q_0$. Under AZR's on-policy updates with $\gamma_t\equiv 0$ and no off-policy data,

Figures (7)

  • Figure 1: Self-play models are still bounded by the base model. (Top row) Pass@k curves for AZR-Coder-3B (red) and Qwen2.5-Coder-3B (blue). (Bottom row) Pass@k curves for AZR-Coder-7B (red) and Qwen2.5-Coder-7B (blue).
  • Figure 2: AZR-Coder-7B adapts response length to question difficulty while Qwen2.5-Coder-7B does so at a lesser scale. (Left) Average response length at every 25th iteration. (Right) Average solve rate at every 25th iteration.
  • Figure 3: Policy entropy decays at different rates based on model size and setup. Policy entropy curves for AZR-Coder-3B, AZR-Coder-3B with a frozen proposer, and AZR-Coder-7B.
  • Figure 4: Proposer entropy stays higher than solver entropy. Proposer, solver, and policy entropy curves for AZR-Coder-3B (left) and AZR-Coder-7B (right)
  • Figure 5: Self-play has distinct update sparsity compared to RL-tuned and SFT models. Update sparsity comparison between public checkpoints of fine-tuned models and their corresponding base models: (left) Qwen2.5-Coder-3B and (right) Qwen2.5-Coder-7B.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 1: Role-wise support preservation for AZR
  • Corollary 1: Reasoning boundary / zero-probability barrier