Table of Contents
Fetching ...

Guided Self-Evolving LLMs with Minimal Human Supervision

Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang, Haitao Mi, Dong Yu

TL;DR

The paper tackles drift and diversity collapse in unguided self-evolving LLMs by introducing R-Few, a guided self-play framework that uses a few-shot grounded Challenger and an online curriculum-based Solver. By sampling a small set of human anchors and continuously ranking mid-difficulty problems, R-Few achieves stable, iterative improvements on both mathematical and general reasoning benchmarks with far less human data than fully supervised systems. Ablation and analysis show that grounding and curriculum learning are key to mitigating drift and maintaining productive co-evolution. The results demonstrate substantial data efficiency and suggest practical pathways to scalable, controllable self-improvement for large language models.

Abstract

AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.

Guided Self-Evolving LLMs with Minimal Human Supervision

TL;DR

The paper tackles drift and diversity collapse in unguided self-evolving LLMs by introducing R-Few, a guided self-play framework that uses a few-shot grounded Challenger and an online curriculum-based Solver. By sampling a small set of human anchors and continuously ranking mid-difficulty problems, R-Few achieves stable, iterative improvements on both mathematical and general reasoning benchmarks with far less human data than fully supervised systems. Ablation and analysis show that grounding and curriculum learning are key to mitigating drift and maintaining productive co-evolution. The results demonstrate substantial data efficiency and suggest practical pathways to scalable, controllable self-improvement for large language models.

Abstract

AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.

Paper Structure

This paper contains 30 sections, 13 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: R-Few delays the performance plateau seen in R-Zero and achieves higher performance. After training, it outperforms baselines multiple benchmarks, showing more stable self-evolution.
  • Figure 2: The math examples shown in the figure are not real data, but are included for demonstration to aid understanding. The figure provides an overview of our R-Few framework. The Challenger is incentivized to generate moderately ("medium") uncertain questions that lie at the edge of the Solver’s current abilities; the Solver is rewarded for solving increasingly challenging tasks -- sourced from both humans and the Challenger -- via curriculum-based selection.
  • Figure 3: Impact of domain-sampled human data on performance across MMLU-Pro categories.
  • Figure 4: Training curves of synthetic question diversity (measured by 2-gram lexical diversity), length (measured by word count), and difficulty (evaluated by Qwen3-8B-Base, with ground-truth labeled by Gemini-2.5-Pro) over training. R-Zero collapses in diversity and exhibits length inflation via verbosity, whereas R-Few maintains stable length and diversity during self-evolution.
  • Figure 5: Training curve of the solver (Qwen3-8B-Base), trained for 100 steps while alternating with the challenger.