Table of Contents
Fetching ...

SPICE: Self-Play In Corpus Environments Improves Reasoning

Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, Jason Weston

TL;DR

SPICE introduces a corpus-grounded self-play framework in which a single model alternates as Challenger and Reasoner to generate and solve document-grounded tasks. The external document corpus provides verifiable signals, preventing hallucination and enabling an automatic curriculum that challenges the Reasoner at the frontier of capability. Empirical results show consistent improvements across mathematical and general reasoning benchmarks across multiple base models, outperforming ungrounded self-play baselines. The approach represents a shift toward sustained, environment-driven self-improvement for large language models with broad transfer across domains.

Abstract

Self-improving systems require environmental interaction for continuous adaptation. We introduce SPICE (Self-Play In Corpus Environments), a reinforcement learning framework where a single model acts in two roles: a Challenger that mines documents from a large corpus to generate diverse reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics, the Challenger creates an automatic curriculum at the frontier of the Reasoner's capability, while corpus grounding provides the rich, near-inexhaustible external signal necessary for sustained improvement. Unlike existing ungrounded self-play methods that offer more limited benefits, SPICE achieves consistent gains across mathematical (+8.9%) and general reasoning (+9.8%) benchmarks on multiple model families. Our analysis reveals how document grounding is a key ingredient in SPICE to continuously generate its own increasingly challenging goals and achieve them, enabling sustained self-improvement.

SPICE: Self-Play In Corpus Environments Improves Reasoning

TL;DR

SPICE introduces a corpus-grounded self-play framework in which a single model alternates as Challenger and Reasoner to generate and solve document-grounded tasks. The external document corpus provides verifiable signals, preventing hallucination and enabling an automatic curriculum that challenges the Reasoner at the frontier of capability. Empirical results show consistent improvements across mathematical and general reasoning benchmarks across multiple base models, outperforming ungrounded self-play baselines. The approach represents a shift toward sustained, environment-driven self-improvement for large language models with broad transfer across domains.

Abstract

Self-improving systems require environmental interaction for continuous adaptation. We introduce SPICE (Self-Play In Corpus Environments), a reinforcement learning framework where a single model acts in two roles: a Challenger that mines documents from a large corpus to generate diverse reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics, the Challenger creates an automatic curriculum at the frontier of the Reasoner's capability, while corpus grounding provides the rich, near-inexhaustible external signal necessary for sustained improvement. Unlike existing ungrounded self-play methods that offer more limited benefits, SPICE achieves consistent gains across mathematical (+8.9%) and general reasoning (+9.8%) benchmarks on multiple model families. Our analysis reveals how document grounding is a key ingredient in SPICE to continuously generate its own increasingly challenging goals and achieve them, enabling sustained self-improvement.

Paper Structure

This paper contains 23 sections, 5 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: SPICE (Self-Play In Corpus Environments) outperforms state-of-the-art self-play methods for LLMs on Qwen3-4B-Base (right). Training the model in SPICE to be both a Challenger and Reasoner, creating and solving challenging corpus-grounded tasks for itself via self-play RL is crucial to this success (left), see ablations for details.
  • Figure 2: SPICE is a self-play framework where a single LLM, $\pi_\theta$, acts in two roles: a Challenger ($\text{role}=C$), which poses difficult questions, and a Reasoner ($\text{role}=R$), which tries to correctly answer such questions. The Challenger uses a raw document (which does not contain existing questions or labels) from a corpus to generate a ($q$, $a^*$) pair. Depending on the contents of the document, the Challenger can generate either a closed-form multiple choice question or a free-form question with different answer types (integer, float, expression). The Challenger gets rewarded for questions with high variance in Reasoner's correctness, and the Reasoner gets rewarded for answering questions correctly.
  • Figure 3: Reasoner pass rates when evaluating SPICE checkpoints at steps 200-640 against a fixed step-200 checkpoint on 128 documents. (a) Fixed Reasoner: Pass rate decreases from 55% to 35% as later Challenger checkpoints generate harder questions. (b) Fixed Challenger: Pass rate increases from 55% to 85% as later Reasoner checkpoints improve at solving questions.
  • Figure 4: Training dynamics comparing key components of SPICE. (a) Challenger Ablation: Without Challenger training (fixed Challenger), performance improves less than with full adversarial training. (b) Corpus Ablation: With corpus grounding, the model achieves steady improvement to 43.9%, while without it performance is lower at 40.7%, showing the importance of external grounding.
  • Figure 5: Evolution of the Challenger's task complexity on the same document. The Challenger progresses from extracting explicit facts to generating questions requiring understanding of angular size relationships and proportional reasoning to arrive at document-stated values.
  • ...and 5 more figures