Table of Contents
Fetching ...

Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain

Wei Liu, Siya Qi, Yali Du, Yulan He

TL;DR

It is revealed that sustainable self-evolution requires a self-synthesised data pipeline with learnable information that increases across iterations, and triadic roles that self-evolving LLMs play are identified, providing a measurable, system-level path from brittle self-play dynamics to sustained self-evolution.

Abstract

Large language models (LLMs) make it plausible to build systems that improve through self-evolving loops, but many existing proposals are better understood as self-play and often plateau quickly. A central failure mode is that the loop synthesises more data without increasing learnable information for the next iteration. Through experiments on a self-play coding task, we reveal that sustainable self-evolution requires a self-synthesised data pipeline with learnable information that increases across iterations. We identify triadic roles that self-evolving LLMs play: the Proposer, which generates tasks; the Solver, which attempts solutions; and the Verifier, which provides training signals, and we identify three system designs that jointly target learnable information gain from this triadic roles perspective. Asymmetric co-evolution closes a weak-to-strong-to-weak loop across roles. Capacity growth expands parameter and inference-time budgets to match rising learnable information. Proactive information seeking introduces external context and new task sources that prevent saturation. Together, these modules provide a measurable, system-level path from brittle self-play dynamics to sustained self-evolution.

Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain

TL;DR

It is revealed that sustainable self-evolution requires a self-synthesised data pipeline with learnable information that increases across iterations, and triadic roles that self-evolving LLMs play are identified, providing a measurable, system-level path from brittle self-play dynamics to sustained self-evolution.

Abstract

Large language models (LLMs) make it plausible to build systems that improve through self-evolving loops, but many existing proposals are better understood as self-play and often plateau quickly. A central failure mode is that the loop synthesises more data without increasing learnable information for the next iteration. Through experiments on a self-play coding task, we reveal that sustainable self-evolution requires a self-synthesised data pipeline with learnable information that increases across iterations. We identify triadic roles that self-evolving LLMs play: the Proposer, which generates tasks; the Solver, which attempts solutions; and the Verifier, which provides training signals, and we identify three system designs that jointly target learnable information gain from this triadic roles perspective. Asymmetric co-evolution closes a weak-to-strong-to-weak loop across roles. Capacity growth expands parameter and inference-time budgets to match rising learnable information. Proactive information seeking introduces external context and new task sources that prevent saturation. Together, these modules provide a measurable, system-level path from brittle self-play dynamics to sustained self-evolution.
Paper Structure (18 sections, 5 equations, 12 figures, 1 algorithm)

This paper contains 18 sections, 5 equations, 12 figures, 1 algorithm.

Figures (12)

  • Figure 1: A self-evolving LLM plays three roles as Proposer, Solver and Verifier. The whole self-evolving process can be seen as different synthetic operations (synthesis qa, solution and feedback) on the same information source, which is the LLM itself.
  • Figure 2: Overall framework of a triadic self-evolving loop. A self-evolving LLM plays three roles: the Proposer and Verifier form the internal environment, proactively interacting with the external environment to provide data and supervision for the Solver. The Solver and internal environment co-evolve asymmetrically, adaptively expanding capacity to capture more learnable information. From an information perspective, the system continually absorbs external information, and transform them into internal learnable information.
  • Figure 3: Illustration of three designs from the perspective of learnable information. Asymmetry between the Solver and the Proposer/Verifier creates learning opportunities. Expanding model capacity to match self evolving data opens space for learnable information. Reusing the same patterns in new contexts yields limited gains, whereas introducing new synthetic directions creates fresh asymmetries and thus new sources of learnable information.
  • Figure 4: Climbing the intelligence asymmetry ladder by closing the loop among Proposer, Solver, and Verifier. "Intelligence synchronisation" denotes updating the weaker Proposer/Verifier with strong Solver. "Reinforcement learning" uses the weaker Proposer/Verifier to train the Solver.
  • Figure 5: Epiplexity results on synthetic data with different tasks (induction, abduction and deduction) proposed by different Proposer LLMs and observed by different Solver LLMs. See details of calculating epiplexity in Appendix \ref{['appendix:detail_epiplexity']}.
  • ...and 7 more figures