Towards Understanding Self-play for LLM Reasoning
Justin Yang Chae, Md Tanvirul Alam, Nidhi Rastogi
TL;DR
This work analyzes self-play for LLM reasoning via the Absolute Zero Reasoner (AZR), comparing it against RLVR and SFT to uncover training dynamics that accompany reasoning gains. By examining pass@k performance, entropy dynamics, and parameter update sparsity across two AZR sizes, the study shows that self-play improves reasoning at low $k$ but remains bounded by the base model's capacity at high $k$, with gains likely arising from distributional sharpening and co-evolutionary data diversity. The results highlight the proposer as the crucial component, revealing that AZR generates increasingly difficult curricula and adjusts response length with problem difficulty, while exhibiting entropy collapse and intermediate update sparsity. The paper suggests future directions including expanding self-play frameworks, automatic curriculum design, and entropy-regulation strategies to push LLM math reasoning beyond current limits.
Abstract
Recent advances in large language model (LLM) reasoning, led by reinforcement learning with verifiable rewards (RLVR), have inspired self-play post-training, where models improve by generating and solving their own problems. While self-play has shown strong in-domain and out-of-domain gains, the mechanisms behind these improvements remain poorly understood. In this work, we analyze the training dynamics of self-play through the lens of the Absolute Zero Reasoner, comparing it against RLVR and supervised fine-tuning (SFT). Our study examines parameter update sparsity, entropy dynamics of token distributions, and alternative proposer reward functions. We further connect these dynamics to reasoning performance using pass@k evaluations. Together, our findings clarify how self-play differs from other post-training strategies, highlight its inherent limitations, and point toward future directions for improving LLM math reasoning through self-play.
