Learning to Self-Evolve

Xiaoyin Chen; Canwen Xu; Yite Wang; Boyi Liu; Zhewei Yao; Yuxiong He

Learning to Self-Evolve

Xiaoyin Chen, Canwen Xu, Yite Wang, Boyi Liu, Zhewei Yao, Yuxiong He

Abstract

We introduce Learning to Self-Evolve (LSE), a reinforcement learning framework that trains large language models (LLMs) to improve their own contexts at test time. We situate LSE in the setting of test-time self-evolution, where a model iteratively refines its context from feedback on seen problems to perform better on new ones. Existing approaches rely entirely on the inherent reasoning ability of the model and never explicitly train it for this task. LSE reduces the multi-step evolution problem to a single-step RL objective, where each context edit is rewarded by the improvement in downstream performance. We pair this objective with a tree-guided evolution loop. On Text-to-SQL generation (BIRD) and general question answering (MMLU-Redux), a 4B-parameter model trained with LSE outperforms self-evolving policies powered by GPT-5 and Claude Sonnet 4.5, as well as prompt optimization methods including GEPA and TextGrad, and transfers to guide other models without additional training. Our results highlight the effectiveness of treating self-evolution as a learnable skill.

Learning to Self-Evolve

Abstract

Paper Structure (31 sections, 10 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 31 sections, 10 equations, 4 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Training-time self-evolution.
Test-time self-evolution.
Method
Test-Time Inter-Episode Evolution
Prompt-Based Evolution with Tree Search
Tree-guided evolution.
Learning to Self-Evolve (LSE)
Experiments
Experimental Setup
Models.
Tasks and datasets.
Baselines.
Evaluation protocol.
...and 16 more sections

Figures (4)

Figure 1: Overview of Learning to Self-Evolve (LSE). Left: Tree-guided self-evolution at test time. Upper Confidence Bound (UCB) selection chooses a context from the evolution tree; the action model generates outputs for a new batch of problems; the self-evolving policy receives the performance summary and proposes a revised context. Right: LSE trains the self-evolving policy via RL with an improvement-based reward computed as the difference between post-edit and pre-edit performance.
Figure 2: Ablation studies on reward design and search strategy. (a) $A_{\mathrm{GRPO}}$ uses $\bar{R}(c_1)$ with GRPO's group-based advantage; $A_{\mathrm{LSE}}$ uses the improvement-based reward $r_{\mathrm{LSE}} = \bar{R}(c_1) - \bar{R}(c_0)$ (Eq. \ref{['eq:meta_reward']}). (b) Tree search (UCB) vs. linear chain (always extends the most recent node), both with the untrained Qwen3-4B-Instruct as $f_\psi$.
Figure 3: Per-round average accuracy on the BIRD Card Games database. The linear chain cannot recover from bad edits, while tree search (UCB) backtracks to higher-scoring ancestors.
Figure 4: Search strategy ablation on MMLU-Redux, complementing Figure \ref{['fig:ablation_search']}. Both variants use the untrained Qwen3-4B-Instruct as the self-evolving policy $f_\psi$. Tree search improves the average accuracy from 69.0% to 71.2%.

Learning to Self-Evolve

Abstract

Learning to Self-Evolve

Authors

Abstract

Table of Contents

Figures (4)