Table of Contents
Fetching ...

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

Jacob Dineen, Aswin RRV, Zhikun Xu, Ben Zhou

Abstract

Co-evolutionary self-play, where one language model generates problems and another solves them, promises autonomous curriculum learning without human supervision. In practice, the proposer quickly converges to a narrow distribution of problems that satisfy the reward function. This diversity collapse renders the curriculum uninformative for the solver, stalling the co-evolutionary loop. We introduce vocabulary dropout, a random mask applied to the proposer's output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Training Qwen3-4B and Qwen3-8B on mathematical reasoning via R-Zero, we find that vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training, and yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks. Our findings suggest that explicit action-space constraints, analogous to the structural role that game rules play in classical self-play, can help sustain productive co-evolution in language. Vocabulary dropout is one simple instantiation of this principle.

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

Abstract

Co-evolutionary self-play, where one language model generates problems and another solves them, promises autonomous curriculum learning without human supervision. In practice, the proposer quickly converges to a narrow distribution of problems that satisfy the reward function. This diversity collapse renders the curriculum uninformative for the solver, stalling the co-evolutionary loop. We introduce vocabulary dropout, a random mask applied to the proposer's output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Training Qwen3-4B and Qwen3-8B on mathematical reasoning via R-Zero, we find that vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training, and yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks. Our findings suggest that explicit action-space constraints, analogous to the structural role that game rules play in classical self-play, can help sustain productive co-evolution in language. Vocabulary dropout is one simple instantiation of this principle.

Paper Structure

This paper contains 53 sections, 4 equations, 7 figures, 11 tables, 1 algorithm.

Figures (7)

  • Figure 1: Training pipeline. Left: Vocabulary dropout masks a random subset of output logits, constraining the proposer's token distribution. Right: The co-evolution loop. In Phase 1 (proposer training), the proposer generates $K$ problems, the frozen solver attempts each $M$ times, and the proposer is rewarded based on solver uncertainty. In Phase 2 (solver training), the frozen proposer generates a curriculum of $K$ problems, the solver attempts each $M$ times, and the solver is rewarded for matching the correct answer.
  • Figure 2: Question profile at iteration 5 (% change from baseline). VD85 VD75.
  • Figure 3: Diversity and curriculum quality over co-evolution iterations ($\alpha{=}0.75$, both phases). Top: collapse signals. Bottom: diversity metrics. Dashed = baseline, solid = dropout. Semantic metrics (b, d, e) use text-embedding-3-smallopenai2024embeddings.
  • Figure 4: Qwen3-8B solver accuracy across iterations under fixed vs. annealed ($0.75 {\to} 1.0$) vocabulary dropout. Green $\alpha$ values below each tick show the anneal schedule. (a) Mean of MATH500, GSM8K, OlympiadBench, and Minerva Math.
  • Figure 5: Vocabulary dropout as a unified diff. The only change is sampling a Bernoulli mask and passing the surviving token IDs to vLLM's allowed_token_ids before generation. The mask is resampled every batch.
  • ...and 2 more figures