SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent
Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R. Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
TL;DR
SkyRL-Agent introduces a modular, backend-agnostic framework for efficient multi-turn RL training of LLM-based agents, emphasizing a tool-centric loop, a fine-grained asynchronous dispatcher, and a backend bridge for interoperability. The approach enables scalable, heterogeneous rollout scheduling and robust transition-based data capture, improving throughput and sample efficiency. The SWE case study SA-SWE-32B, trained purely with RL from Qwen3-32B, achieves 39.4% Pass@1 on SWE-Bench Verified with more than 2× cost reduction, driven by a 1.55× speedup from the Async Pipeline dispatcher and an AST-based tool-guided training recipe, while generalizing to tasks like Terminal-Bench, WebArena, and BrowseComp-Plus. Additional agents (Deep Research, Memory, Computer Use) demonstrate the framework’s versatility across backends and toolsets, underscoring SkyRL-Agent’s promise for scalable, tool-augmented, long-horizon RL research. Overall, the work addresses key bottlenecks in agent rollout orchestration and tool integration, offering practical, cost-efficient pathways to develop capable multi-turn LLM agents.
Abstract
We introduce SkyRL-Agent, a framework for efficient, multi-turn, long-horizon agent training and evaluation. It provides efficient asynchronous dispatching, lightweight tool integration, and flexible backend interoperability, enabling seamless use with existing RL frameworks such as SkyRL-train, VeRL, and Tinker. Using SkyRL-Agent, we train SA-SWE-32B, a software engineering agent trained from Qwen3-32B (24.4% Pass@1) purely with reinforcement learning. We introduce two key components: an optimized asynchronous pipeline dispatcher that achieves a 1.55x speedup over naive asynchronous batching, and a tool-enhanced training recipe leveraging an AST-based search tool to facilitate code navigation, boost rollout Pass@K, and improve training efficiency. Together, these optimizations enable SA-SWE-32B to reach 39.4% Pass@1 on SWE-Bench Verified with more than 2x cost reduction compared to prior models reaching similar performance. Despite being trained solely on SWE tasks, SA-SWE-32B generalizes effectively to other agentic tasks, including Terminal-Bench, BrowseComp-Plus, and WebArena. We further demonstrate SkyRL-Agent's extensibility through case studies on deep research, computer use, and memory agents, each trained using a different training backend.
