Table of Contents
Fetching ...

SAGE: Multi-Agent Self-Evolution for LLM Reasoning

Yulin Peng, Xinxin Zhu, Chenxing Wei, Nianbo Zeng, Leilei Wang, Ying Tiffany He, F. Richard Yu

Abstract

Reinforcement learning with verifiable rewards improves reasoning in large language models (LLMs), but many methods still rely on large human-labeled datasets. While self-play reduces this dependency, it often lacks explicit planning and strong quality control, limiting stability in long-horizon multi-step reasoning. We present SAGE (Self-evolving Agents for Generalized reasoning Evolution), a closed-loop framework where four agents: Challenger, Planner, Solver, and Critic, co-evolve from a shared LLM backbone using only a small seed set. The Challenger continuously generates increasingly difficult tasks; the Planner converts each task into a structured multi-step plan; and the Solver follows the plan to produce an answer, whose correctness is determined by external verifiers. The Critic scores and filters both generated questions and plans to prevent curriculum drift and maintain training signal quality, enabling stable self-training. Across mathematics and code-generation benchmarks, SAGE delivers consistent gains across model scales, improving the Qwen-2.5-7B model by 8.9% on LiveCodeBench and 10.7% on OlympiadBench.

SAGE: Multi-Agent Self-Evolution for LLM Reasoning

Abstract

Reinforcement learning with verifiable rewards improves reasoning in large language models (LLMs), but many methods still rely on large human-labeled datasets. While self-play reduces this dependency, it often lacks explicit planning and strong quality control, limiting stability in long-horizon multi-step reasoning. We present SAGE (Self-evolving Agents for Generalized reasoning Evolution), a closed-loop framework where four agents: Challenger, Planner, Solver, and Critic, co-evolve from a shared LLM backbone using only a small seed set. The Challenger continuously generates increasingly difficult tasks; the Planner converts each task into a structured multi-step plan; and the Solver follows the plan to produce an answer, whose correctness is determined by external verifiers. The Critic scores and filters both generated questions and plans to prevent curriculum drift and maintain training signal quality, enabling stable self-training. Across mathematics and code-generation benchmarks, SAGE delivers consistent gains across model scales, improving the Qwen-2.5-7B model by 8.9% on LiveCodeBench and 10.7% on OlympiadBench.
Paper Structure (19 sections, 13 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 19 sections, 13 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of the SAGE framework. Four specialized agents—Challenger, Planner, Solver, and Critic—interact through quality filtering and format validation to enable closed-loop self-evolution.
  • Figure 2: The SAGE training pipeline. (1) The Challenger generates questions from reference examples, filtered by the Critic for quality; (2) verified questions expand the dataset; (3) sampled questions are processed by the Planner and Solver to produce solutions; (4) all agents are jointly updated using Task-Relative REINFORCE++ with per-role advantage normalization.
  • Figure 3: Training dynamics on Qwen-2.5-3B. The Challenger steadily expands the question pool (bars) throughout training, while validation accuracy (line) reaches peak performance around step 100--120 before gradual decline, suggesting potential over-specialization on the self-generated curriculum.
  • Figure 4: Qualitative case study. The Challenger generates a math word problem, the Planner decomposes it into structured steps, the Solver executes the plan to produce the final answer, and the Critic provides quality scores for both the question and the plan.
  • Figure 5: The prompt of the Challenger Agent.
  • ...and 5 more figures