Table of Contents
Fetching ...

Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

Yulei Qin, Xiaoyu Tan, Zhengbao He, Gang Li, Haojia Lin, Zongyi Li, Zihan Xu, Yuchen Shi, Siqi Cai, Renting Rui, Shaofei Cai, Yuzheng Cai, Xuan Zhang, Sheng Ye, Ke Li, Xing Sun

TL;DR

This work addresses the exploration–exploitation trade-off in RL for agentic LLMs by introducing SPEAR, a curriculum-guided self-imitation learning framework that progressively shifts policy entropy alongside intrinsic rewards. SPEAR extends vanilla SELF-IMITATION with advantage recalibration, prioritized replay, and curriculum schedules to harness past successful experiences while maintaining stable entropy. The approach is paired with a strong Dr.BoT baseline and validated across ALFWorld, WebShop, and AIME-style tasks, yielding substantial gains with modest computational overhead. The results demonstrate SPEAR's plug-and-play scalability and its potential to enhance robust tool-use and reasoning in open-world, multi-turn LLM agents.

Abstract

Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent's own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL, where a replay buffer stores good experience for off-policy update, by gradually steering the policy entropy across stages. Specifically, the proposed curriculum scheduling harmonizes intrinsic reward shaping and self-imitation to 1) expedite exploration via frequent tool interactions at the beginning, and 2) strengthen exploitation of successful tactics upon convergence towards familiarity with the environment. We also combine bag-of-tricks of industrial RL optimizations for a strong baseline Dr.BoT to demonstrate our effectiveness. In ALFWorld and WebShop, SPEAR increases the success rates of GRPO/GiGPO/Dr.BoT by up to 16.1%/5.1%/8.6% and 20.7%/11.8%/13.9%, respectively. In AIME24 and AIME25, SPEAR boosts Dr.BoT by up to 3.8% and 6.1%, respectively. Such gains incur only 10%-25% extra theoretical complexity and negligible runtime overhead in practice, demonstrating the plug-and-play scalability of SPEAR.

Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

TL;DR

This work addresses the exploration–exploitation trade-off in RL for agentic LLMs by introducing SPEAR, a curriculum-guided self-imitation learning framework that progressively shifts policy entropy alongside intrinsic rewards. SPEAR extends vanilla SELF-IMITATION with advantage recalibration, prioritized replay, and curriculum schedules to harness past successful experiences while maintaining stable entropy. The approach is paired with a strong Dr.BoT baseline and validated across ALFWorld, WebShop, and AIME-style tasks, yielding substantial gains with modest computational overhead. The results demonstrate SPEAR's plug-and-play scalability and its potential to enhance robust tool-use and reasoning in open-world, multi-turn LLM agents.

Abstract

Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent's own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL, where a replay buffer stores good experience for off-policy update, by gradually steering the policy entropy across stages. Specifically, the proposed curriculum scheduling harmonizes intrinsic reward shaping and self-imitation to 1) expedite exploration via frequent tool interactions at the beginning, and 2) strengthen exploitation of successful tactics upon convergence towards familiarity with the environment. We also combine bag-of-tricks of industrial RL optimizations for a strong baseline Dr.BoT to demonstrate our effectiveness. In ALFWorld and WebShop, SPEAR increases the success rates of GRPO/GiGPO/Dr.BoT by up to 16.1%/5.1%/8.6% and 20.7%/11.8%/13.9%, respectively. In AIME24 and AIME25, SPEAR boosts Dr.BoT by up to 3.8% and 6.1%, respectively. Such gains incur only 10%-25% extra theoretical complexity and negligible runtime overhead in practice, demonstrating the plug-and-play scalability of SPEAR.

Paper Structure

This paper contains 108 sections, 28 equations, 14 figures, 10 tables, 1 algorithm.

Figures (14)

  • Figure 1: Our SPEAR harmonizes the curriculum-scheduled self-imitation learning with intrinsic reward shaping for progressive exploration, improving policy performance across agentic tasks.
  • Figure 2: Overview of SPEAR. First, the agent interacts with the environment for a set of trajectories, which flow through intrinsic reward shaping and advantage estimation with on-policy updates. Second, they are selected and stored in a replay buffer, enabling off-policy updates via the proposed self-imitation scheme. This dual integration allows the maximal utility of past experiences, thereby expanding the effective exploration space, while simultaneously mitigating persistent uncertainty.
  • Figure 3: Effect of our self-imitation on action-level strategy exploration (Qwen2.5-32B with code interpreter). The vanilla experience replay technique oh2018self that enforces early overfitting of the few available trajectories in the buffer causes entropy collapsing and exploration shrinkage. At the beginning, the LLM agent struggles at tool-calling skills and fails to cultivate the transition of distribution towards frequent tool utilization and tool-integrated reasoning. The naive replay limits the transformation of reasoning paradigm. In contrast, our SPEAR introduces both curriculum- and covariance- based regularization into self-imitation. Its curriculum schedule with an increasing emphasis on the replay data allows easy acquisition of tool-use skills at first, and stimulates strategic action plans later. The covariance clipping removes over-confident tokens, whose log probabilities are highly associated with their advantage gains, out of optimization. Our self-imitation gives promises to exploring novel strategies and achieves steady growth on AIME 2025.
  • Figure 4: Effect of our intrinsic reward on skill-level strategy exploration (Qwen2.5-32B with code interpreter). The baseline does not consider tool-calling as a rewarded behavior and its number of interaction with the environment drops quickly due to the negative feedback of bad codes. In this case, the LLM gives up coding and degrades to text-based reasoning. The vanilla tool-call reward, despite being effective in learning tool-call skills at first, causes competition with the outcome reward later. Due to the limited context length, the excessive tool-call turns prevents submission of the final answer and thereafter the accuracy declines immediately. We propose the curriculum schedule as an intrinsic reward design where its strength decays over step to allow the agent to merely focus on the accuracy with wiser actions. It prevents reward hacking for unnecessarily long interactions.
  • Figure 5: The agent learns to push the box.
  • ...and 9 more figures