Table of Contents
Fetching ...

Demystifying Reinforcement Learning in Agentic Reasoning

Zhaochen Yu, Ling Yang, Jiaru Zou, Shuicheng Yan, Mengdi Wang

TL;DR

The paper tackles the challenge of scaling agentic reinforcement learning for LLMs, focusing on three facets: data, algorithm, and reasoning mode. It demonstrates that real end-to-end trajectories, diverse and model-aware data, and simple RL recipes (clip higher, overlong reward shaping, and token-level loss) yield significant gains in agentic reasoning and training efficiency. A core finding is that maintaining moderate policy entropy and adopting a deliberate reasoning-before-tool-use approach improve tool efficiency and final accuracy, while Long-CoT priors can hinder agentic RL. The authors release a 3k real SFT dataset, a 30k RL dataset, and a strong 4B baseline model (DemyAgent-4B) that achieves state-of-the-art agentic performance on challenging benchmarks, establishing practical baselines and guiding future agentic RL research.

Abstract

Recently, the emergence of agentic RL has showcased that RL could also effectively improve the agentic reasoning ability of LLMs, yet the key design principles and optimal practices remain unclear. In this work, we conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning from three key perspectives: data, algorithm, and reasoning mode. We highlight our key insights: (i) Replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT initialization; high-diversity, model-aware datasets sustain exploration and markedly improve RL performance. (ii) Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency. (iii) A deliberative strategy with fewer tool calls outperforms frequent tool calls or verbose self-reasoning, improving tool efficiency and final accuracy. Together, these simple practices consistently enhance agentic reasoning and training efficiency, achieving strong results on challenging benchmarks with smaller models, and establishing a practical baseline for future agentic RL research. Beyond these empirical insights, we further contribute a high-quality, real end-to-end agentic SFT dataset along with a high-quality RL dataset, and demonstrate the effectiveness of our insights in boosting the agentic reasoning ability of LLMs across four challenging benchmarks, including AIME2024/AIME2025, GPQA-Diamond, and LiveCodeBench-v6. With our recipes, 4B-sized models could also achieve superior agentic reasoning performance compared to 32B-sized models. Code and models: https://github.com/Gen-Verse/Open-AgentRL

Demystifying Reinforcement Learning in Agentic Reasoning

TL;DR

The paper tackles the challenge of scaling agentic reinforcement learning for LLMs, focusing on three facets: data, algorithm, and reasoning mode. It demonstrates that real end-to-end trajectories, diverse and model-aware data, and simple RL recipes (clip higher, overlong reward shaping, and token-level loss) yield significant gains in agentic reasoning and training efficiency. A core finding is that maintaining moderate policy entropy and adopting a deliberate reasoning-before-tool-use approach improve tool efficiency and final accuracy, while Long-CoT priors can hinder agentic RL. The authors release a 3k real SFT dataset, a 30k RL dataset, and a strong 4B baseline model (DemyAgent-4B) that achieves state-of-the-art agentic performance on challenging benchmarks, establishing practical baselines and guiding future agentic RL research.

Abstract

Recently, the emergence of agentic RL has showcased that RL could also effectively improve the agentic reasoning ability of LLMs, yet the key design principles and optimal practices remain unclear. In this work, we conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning from three key perspectives: data, algorithm, and reasoning mode. We highlight our key insights: (i) Replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT initialization; high-diversity, model-aware datasets sustain exploration and markedly improve RL performance. (ii) Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency. (iii) A deliberative strategy with fewer tool calls outperforms frequent tool calls or verbose self-reasoning, improving tool efficiency and final accuracy. Together, these simple practices consistently enhance agentic reasoning and training efficiency, achieving strong results on challenging benchmarks with smaller models, and establishing a practical baseline for future agentic RL research. Beyond these empirical insights, we further contribute a high-quality, real end-to-end agentic SFT dataset along with a high-quality RL dataset, and demonstrate the effectiveness of our insights in boosting the agentic reasoning ability of LLMs across four challenging benchmarks, including AIME2024/AIME2025, GPQA-Diamond, and LiveCodeBench-v6. With our recipes, 4B-sized models could also achieve superior agentic reasoning performance compared to 32B-sized models. Code and models: https://github.com/Gen-Verse/Open-AgentRL

Paper Structure

This paper contains 33 sections, 8 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: An overview of our research on agentic RL.
  • Figure 2: Comparison between our dataset with higher diversity and the ReTool dataset, which only contains math problems. Left is the average@32 accuracy on AIME2025 during training based on two different dataset. Right is the policy entropy during the training process.
  • Figure 3: The comparison and analysis between the impact of the 30k full dataset and our tailored dataset for Qwen2.5-RA-SFT on subsequent RL training. Left is the average@32 performance on AIME2025. Right is the analysis for the average reward during training.
  • Figure 4: The overall performance of our constructed three recipes: GRPO-T, GRPO-TCR, and GRPO-SCR on AIME2024/AIME2025 benchmark.
  • Figure 5: The analysis for the policy entropy in agentic RL training.
  • ...and 5 more figures