Table of Contents
Fetching ...

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning

Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, Wei He, Yiwen Ding, Guanyu Li, Zehui Chen, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Tao Gui, Zuxuan Wu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang

TL;DR

AgentGym-RL introduces a modular, end-to-end reinforcement learning framework tailored for training long-horizon, multi-turn LLM agents, complemented by ScalingInter-RL, a curriculum-like horizon-scaling approach that stabilizes optimization and fosters exploration. The framework supports diverse real-world scenarios, broad RL algorithms, and open-source collaboration, with extensive experiments demonstrating competitive performance against proprietary models across web navigation, retrieval, games, embodied, and scientific tasks. Key findings show that ScalingInter-RL yields consistent gains and that post-training/test-time compute can surpass gains from simply increasing model size, while environmental structure strongly influences RL efficiency. The work provides practical insights, robust tooling, and a path toward scalable, real-world LLM agents, with plans to broaden generalization, longer-horizon tasks, and multi-agent extensions.

Abstract

Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch -- without relying on supervised fine-tuning (SFT) -- across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL. The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility. It encompasses a wide variety of real-world scenarios, and supports mainstream RL algorithms. Furthermore, we propose ScalingInter-RL, a training approach designed for exploration-exploitation balance and stable RL optimization. In early stages, it emphasizes exploitation by restricting the number of interactions, and gradually shifts towards exploration with larger horizons to encourage diverse problem-solving strategies. In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons. We perform extensive experiments to validate the stability and effectiveness of both the AgentGym-RL framework and the ScalingInter-RL approach. Our agents match or surpass commercial models on 27 tasks across diverse environments. We offer key insights and will open-source the complete AgentGym-RL framework -- including code and datasets -- to empower the research community in developing the next generation of intelligent agents.

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning

TL;DR

AgentGym-RL introduces a modular, end-to-end reinforcement learning framework tailored for training long-horizon, multi-turn LLM agents, complemented by ScalingInter-RL, a curriculum-like horizon-scaling approach that stabilizes optimization and fosters exploration. The framework supports diverse real-world scenarios, broad RL algorithms, and open-source collaboration, with extensive experiments demonstrating competitive performance against proprietary models across web navigation, retrieval, games, embodied, and scientific tasks. Key findings show that ScalingInter-RL yields consistent gains and that post-training/test-time compute can surpass gains from simply increasing model size, while environmental structure strongly influences RL efficiency. The work provides practical insights, robust tooling, and a path toward scalable, real-world LLM agents, with plans to broaden generalization, longer-horizon tasks, and multi-agent extensions.

Abstract

Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch -- without relying on supervised fine-tuning (SFT) -- across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL. The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility. It encompasses a wide variety of real-world scenarios, and supports mainstream RL algorithms. Furthermore, we propose ScalingInter-RL, a training approach designed for exploration-exploitation balance and stable RL optimization. In early stages, it emphasizes exploitation by restricting the number of interactions, and gradually shifts towards exploration with larger horizons to encourage diverse problem-solving strategies. In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons. We perform extensive experiments to validate the stability and effectiveness of both the AgentGym-RL framework and the ScalingInter-RL approach. Our agents match or surpass commercial models on 27 tasks across diverse environments. We offer key insights and will open-source the complete AgentGym-RL framework -- including code and datasets -- to empower the research community in developing the next generation of intelligent agents.

Paper Structure

This paper contains 74 sections, 6 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Left: Performance of proprietary models, open-source models, and our RL models across different agentic tasks. Right: Performance w.r.t model scale. Working in concert, our framework and method substantially enhances the open-sourced 7B-scale models' capabilities to a level that rivals or even surpasses top-tier proprietary large models.
  • Figure 2: Overview of the AgentGym-RL framework. It features a decoupled, flexible, and extensible architecture, comprising three primary modules—the environment, the agent, and the training module. It supports diverse scenarios, environments, and algorithms.
  • Figure 3: Pseudocode demonstrating the example usage of our proposed framework (provided APIs marked orange), alongside a simplified theoretical diagram illustrating the agent - environment interaction and training pipeline.
  • Figure 4: An overview of the visualized user interface of our framework.
  • Figure 5: Illustration of the ScalingInter-RL approach. It allows the agent to adapt in stages: initially, by limiting interaction turns to prioritize exploitation, master basic skills, and solve easy tasks; later, by gradually increasing interactions to explore, avoid shortcuts, refine behavior, and tackle harder problems. Ultimately, this process trains a stronger agent.
  • ...and 11 more figures