Table of Contents
Fetching ...

Learning to Self-Evolve

Xiaoyin Chen, Canwen Xu, Yite Wang, Boyi Liu, Zhewei Yao, Yuxiong He

Abstract

We introduce Learning to Self-Evolve (LSE), a reinforcement learning framework that trains large language models (LLMs) to improve their own contexts at test time. We situate LSE in the setting of test-time self-evolution, where a model iteratively refines its context from feedback on seen problems to perform better on new ones. Existing approaches rely entirely on the inherent reasoning ability of the model and never explicitly train it for this task. LSE reduces the multi-step evolution problem to a single-step RL objective, where each context edit is rewarded by the improvement in downstream performance. We pair this objective with a tree-guided evolution loop. On Text-to-SQL generation (BIRD) and general question answering (MMLU-Redux), a 4B-parameter model trained with LSE outperforms self-evolving policies powered by GPT-5 and Claude Sonnet 4.5, as well as prompt optimization methods including GEPA and TextGrad, and transfers to guide other models without additional training. Our results highlight the effectiveness of treating self-evolution as a learnable skill.

Learning to Self-Evolve

Abstract

We introduce Learning to Self-Evolve (LSE), a reinforcement learning framework that trains large language models (LLMs) to improve their own contexts at test time. We situate LSE in the setting of test-time self-evolution, where a model iteratively refines its context from feedback on seen problems to perform better on new ones. Existing approaches rely entirely on the inherent reasoning ability of the model and never explicitly train it for this task. LSE reduces the multi-step evolution problem to a single-step RL objective, where each context edit is rewarded by the improvement in downstream performance. We pair this objective with a tree-guided evolution loop. On Text-to-SQL generation (BIRD) and general question answering (MMLU-Redux), a 4B-parameter model trained with LSE outperforms self-evolving policies powered by GPT-5 and Claude Sonnet 4.5, as well as prompt optimization methods including GEPA and TextGrad, and transfers to guide other models without additional training. Our results highlight the effectiveness of treating self-evolution as a learnable skill.
Paper Structure (31 sections, 10 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 31 sections, 10 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of Learning to Self-Evolve (LSE). Left: Tree-guided self-evolution at test time. Upper Confidence Bound (UCB) selection chooses a context from the evolution tree; the action model generates outputs for a new batch of problems; the self-evolving policy receives the performance summary and proposes a revised context. Right: LSE trains the self-evolving policy via RL with an improvement-based reward computed as the difference between post-edit and pre-edit performance.
  • Figure 2: Ablation studies on reward design and search strategy. (a) $A_{\mathrm{GRPO}}$ uses $\bar{R}(c_1)$ with GRPO's group-based advantage; $A_{\mathrm{LSE}}$ uses the improvement-based reward $r_{\mathrm{LSE}} = \bar{R}(c_1) - \bar{R}(c_0)$ (Eq. \ref{['eq:meta_reward']}). (b) Tree search (UCB) vs. linear chain (always extends the most recent node), both with the untrained Qwen3-4B-Instruct as $f_\psi$.
  • Figure 3: Per-round average accuracy on the BIRD Card Games database. The linear chain cannot recover from bad edits, while tree search (UCB) backtracks to higher-scoring ancestors.
  • Figure 4: Search strategy ablation on MMLU-Redux, complementing Figure \ref{['fig:ablation_search']}. Both variants use the untrained Qwen3-4B-Instruct as the self-evolving policy $f_\psi$. Tree search improves the average accuracy from 69.0% to 71.2%.