Table of Contents
Fetching ...

Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning

Mingyue Cheng, Jie Ouyang, Shuo Yu, Ruiran Yan, Yucong Luo, Zirui Liu, Daoyu Wang, Qi Liu, Enhong Chen

TL;DR

This work tackles training powerful LLM agents capable of interacting with environments via tools by rethinking Reinforcement Learning through an extended MDP tailored for multi-turn interactions. It introduces Agent-R1, a modular framework with two core components, Tool and ToolEnv, and emphasizes policy optimization refinements such as action masking and aligned advantages to enable precise credit assignment. Empirical evaluation on multi-hop QA with external search shows substantial gains over baselines across multiple RL algorithms, with GRPO performing best on average and PPO excelling on out-of-domain data. The approach provides a scalable, end-to-end RL paradigm for agentic LLMs, demonstrating practical improvements through structured reward signals and rigorous evaluation.

Abstract

Large Language Models (LLMs) are increasingly being explored for building Agents capable of active environmental interaction (e.g., via tool use) to solve complex problems. Reinforcement Learning (RL) is considered a key technology with significant potential for training such Agents; however, the effective application of RL to LLM Agents is still in its nascent stages and faces considerable challenges. Currently, this emerging field lacks in-depth exploration into RL approaches specifically tailored for the LLM Agent context, alongside a scarcity of flexible and easily extensible training frameworks designed for this purpose. To help advance this area, this paper first revisits and clarifies Reinforcement Learning methodologies for LLM Agents by systematically extending the Markov Decision Process (MDP) framework to comprehensively define the key components of an LLM Agent. Secondly, we introduce Agent-R1, a modular, flexible, and user-friendly training framework for RL-based LLM Agents, designed for straightforward adaptation across diverse task scenarios and interactive environments. We conducted experiments on Multihop QA benchmark tasks, providing initial validation for the effectiveness of our proposed methods and framework.

Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning

TL;DR

This work tackles training powerful LLM agents capable of interacting with environments via tools by rethinking Reinforcement Learning through an extended MDP tailored for multi-turn interactions. It introduces Agent-R1, a modular framework with two core components, Tool and ToolEnv, and emphasizes policy optimization refinements such as action masking and aligned advantages to enable precise credit assignment. Empirical evaluation on multi-hop QA with external search shows substantial gains over baselines across multiple RL algorithms, with GRPO performing best on average and PPO excelling on out-of-domain data. The approach provides a scalable, end-to-end RL paradigm for agentic LLMs, demonstrating practical improvements through structured reward signals and rigorous evaluation.

Abstract

Large Language Models (LLMs) are increasingly being explored for building Agents capable of active environmental interaction (e.g., via tool use) to solve complex problems. Reinforcement Learning (RL) is considered a key technology with significant potential for training such Agents; however, the effective application of RL to LLM Agents is still in its nascent stages and faces considerable challenges. Currently, this emerging field lacks in-depth exploration into RL approaches specifically tailored for the LLM Agent context, alongside a scarcity of flexible and easily extensible training frameworks designed for this purpose. To help advance this area, this paper first revisits and clarifies Reinforcement Learning methodologies for LLM Agents by systematically extending the Markov Decision Process (MDP) framework to comprehensively define the key components of an LLM Agent. Secondly, we introduce Agent-R1, a modular, flexible, and user-friendly training framework for RL-based LLM Agents, designed for straightforward adaptation across diverse task scenarios and interactive environments. We conducted experiments on Multihop QA benchmark tasks, providing initial validation for the effectiveness of our proposed methods and framework.

Paper Structure

This paper contains 23 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of workflows, agentic workflows, and autonomous agents. Workflows rely on human-designed routing or planning, while agentic workflows (e.g., ReAct) introduce iterative reasoning–acting loops. Fully autonomous agents remove predefined workflows and interact with the environment proactively through an end-to-end action–feedback cycle.
  • Figure 2: Illustration of the Agent-R1 training trajectory. The agent performs multi-turn reasoning and tool-based actions during rollout, receives environment feedback, and appends tool responses to form the next state. This trajectory—containing thinking steps, actions, and feedback—serves as the basis for reinforcement learning updates in Agent-R1.
  • Figure 3: Flow diagram of Single-Turn RL and Multi-Turn RL(Agent-R1) in generation stage.
  • Figure 4: Flow diagram of Single-Turn RL and Multi-Turn RL(Agent-R1) in learning stage.