Table of Contents
Fetching ...

The Path of Self-Evolving Large Language Models: Achieving Data-Efficient Learning via Intrinsic Feedback

Hangfan Zhang, Siyuan Xu, Zhimeng Guo, Huaisheng Zhu, Shicheng Liu, Xinrun Wang, Qiaosheng Zhang, Yang Chen, Peng Ye, Lei Bai, Shuyue Hu

TL;DR

This work tackles data efficiency in reinforcing LLM reasoning by introducing self-aware RL, a framework where the model self-generates tasks and self-assesses its capability. Two mechanisms—self-aware difficulty prediction and self-aware limit breaking—guide task selection and external guidance, forming a self-evolving loop between a generator and a solver. The approach leverages a verifiable Python execution environment to create reliable feedback signals and uses REINFORCE++ to train the agents. Across nine benchmarks in mathematical reasoning and code generation, the method achieves substantial gains with minimal external data, underscoring the potential of intrinsic feedback for scalable LLM learning.

Abstract

Reinforcement learning (RL) has demonstrated potential in enhancing the reasoning capabilities of large language models (LLMs), but such training typically demands substantial efforts in creating and annotating data. In this work, we explore improving LLMs through RL with minimal data. Our approach alternates between the LLM proposing a task and then attempting to solve it. To minimize data dependency, we introduce two novel mechanisms grounded in self-awareness: (1) self-aware difficulty prediction, where the model learns to assess task difficulty relative to its own abilities and prioritize challenging yet solvable tasks, and (2) self-aware limit breaking, where the model recognizes when a task is beyond its capability boundary and proactively requests external data to break through that limit. Extensive experiments on nine benchmarks showing a 53.8% relative improvement with less than 1.2% extra data demonstrate the efficacy of self-aware RL and underscore the promise of self-evolving agent training.

The Path of Self-Evolving Large Language Models: Achieving Data-Efficient Learning via Intrinsic Feedback

TL;DR

This work tackles data efficiency in reinforcing LLM reasoning by introducing self-aware RL, a framework where the model self-generates tasks and self-assesses its capability. Two mechanisms—self-aware difficulty prediction and self-aware limit breaking—guide task selection and external guidance, forming a self-evolving loop between a generator and a solver. The approach leverages a verifiable Python execution environment to create reliable feedback signals and uses REINFORCE++ to train the agents. Across nine benchmarks in mathematical reasoning and code generation, the method achieves substantial gains with minimal external data, underscoring the potential of intrinsic feedback for scalable LLM learning.

Abstract

Reinforcement learning (RL) has demonstrated potential in enhancing the reasoning capabilities of large language models (LLMs), but such training typically demands substantial efforts in creating and annotating data. In this work, we explore improving LLMs through RL with minimal data. Our approach alternates between the LLM proposing a task and then attempting to solve it. To minimize data dependency, we introduce two novel mechanisms grounded in self-awareness: (1) self-aware difficulty prediction, where the model learns to assess task difficulty relative to its own abilities and prioritize challenging yet solvable tasks, and (2) self-aware limit breaking, where the model recognizes when a task is beyond its capability boundary and proactively requests external data to break through that limit. Extensive experiments on nine benchmarks showing a 53.8% relative improvement with less than 1.2% extra data demonstrate the efficacy of self-aware RL and underscore the promise of self-evolving agent training.

Paper Structure

This paper contains 20 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The overview of self-aware RL. The generator agent first generates the task, with a predicted success rate $\mu(x)$ (Section \ref{['sec:dp']}). The solver agent will then generates reasoning paths. If none of these generated reasoning paths are correct, which means no update are available, the task will be filtered by a task filter to determine whether it is of enough value to be processed by an external solver (Section \ref{['sec:lb']}). Finally, the collected difficulty prediction reward, outcome reward, and format reward will be aggregated to calculate the policy update (Section \ref{['sec:sarl']}).
  • Figure 2: The training Reward of self-aware RL stably increases as the training continues. The reward is lower at the first few steps since the dialogue template is more complicated in comparison to the baseline. After the agent has been fitted to the new dialogue template, the training reward of self-aware RL quickly increases and surpasses the baseline reward.
  • Figure 3: Accuracy of difficulty prediction. The generator agent driven by the pre-trained base model performs poorly on the task of difficulty prediction (Section \ref{['sec:dp']}), which is shown by the low accuracy at the first step. After being tuned for 50 steps, the generator agent performs much better. Note that the accuracy shown in this figure is measured by the difficulty prediction reward in Equation \ref{['eq:dp']}.
  • Figure 4: Accuracy of rollouts generated by the solver agent. The accuracy is initially high, reflecting that the generator did not generate challenging tasks without training. As the training continues, the difficulty of generated tasks increases and the rollout accuracy gradually decreases, was finally stabilized around 0.6.
  • Figure 5: Utility score $z(x)$ of selected and unselected tasks. Selected tasks should be of high utility, and will be proceeded to the external solver. While unselected tasks are of lower utility and are discarded.