SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

Jiarui Yuan; Tailin Jin; Weize Chen; Zeyuan Liu; Zhiyuan Liu; Maosong Sun

SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

Jiarui Yuan, Tailin Jin, Weize Chen, Zeyuan Liu, Zhiyuan Liu, Maosong Sun

TL;DR

SE-Bench introduces a principled, obfuscated-API benchmark to isolate knowledge internalization as the core of self-evolution. By mapping NumPy to a random ZWC package and enforcing strict information availability rules, it creates a needle-in-a-haystack setting where success depends on true internalization rather than prompt conditioning or external documentation. The study reveals three central findings: (1) internalization requires Closed-Book training to compress knowledge into weights, (2) standard RL struggles to internalize new facts due to PPO clipping and negative gradients, and (3) self-play can enable knowledge internalization when combined with SFT, though RL remains ineffective in this context. Collectively, SE-Bench provides a rigorous diagnostic platform for understanding and advancing self-evolving agents and motivates targeted hybrid strategies to achieve robust internalization.

Abstract

True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new'' knowledge may appear in pre-training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open-Book Paradox, where training with reference documentation inhibits retention, requiring "Closed-Book Training" to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self-Play for internalization, proving models can learn from self-generated, noisy tasks when coupled with SFT, but not RL. Overall, SE-Bench establishes a rigorous diagnostic platform for self-evolution with knowledge internalization. Our code and dataset can be found at https://github.com/thunlp/SE-Bench.

SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

TL;DR

Abstract

Paper Structure (25 sections, 3 equations, 5 figures, 18 tables)

This paper contains 25 sections, 3 equations, 5 figures, 18 tables.

Introduction
SE-Bench
Benchmark Construction
Dataset Splits & Protocol
Metrics
Statistics and Validation
Experiment
Analysis and Insight
RQ1: Does SFT Induce True Internalization, or Merely Context Dependence?
RQ2: Can RL Internalize Knowledge in the Context?
RQ3: Can Self-Play Enable Knowledge Internalization?
RQ4: How Knowledge Evolves from SFT to RL?
Discussion: Connections to Recent Advancements
Related Work
Conclusion
...and 10 more sections

Figures (5)

Figure 1: Overview of the SE-Bench construction pipeline. The process consists of three main stages: (1) Obfuscation, where we implement a wrapper package zwc that renames selected NumPy functions and translates API documentation; (2) Generation, where Claude-4.5-sonnet generates valid tasks and test cases based on the original NumPy library; and (3) Filtering, where tasks are validated through strict consensus between three strong LLMs, followed by human verification.
Figure 2: Closed-SFT vs. Open-SFT. Open-SFT's complete failure without documentation reveals strict context dependency, whereas Closed-SFT successfully internalizes knowledge, maintaining performance even when documentation is absent.
Figure 3: Error type distribution for Closed-SFT and Closed-SFT-RL.
Figure 4: Effects of diversity on knowledge internalization. The left panel shows the impact of question diversity, while the right panel shows the impact of response diversity. Question diversity has a substantially larger influence on knowledge internalization.
Figure 5: The performance of Closed-SFT$_\text{self}$ vs. Open-SFT$_\text{self}$ on test set with or without relevant API documentation.

SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

TL;DR

Abstract

SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

Authors

TL;DR

Abstract

Table of Contents

Figures (5)