Table of Contents
Fetching ...

SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent

Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R. Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

TL;DR

SkyRL-Agent introduces a modular, backend-agnostic framework for efficient multi-turn RL training of LLM-based agents, emphasizing a tool-centric loop, a fine-grained asynchronous dispatcher, and a backend bridge for interoperability. The approach enables scalable, heterogeneous rollout scheduling and robust transition-based data capture, improving throughput and sample efficiency. The SWE case study SA-SWE-32B, trained purely with RL from Qwen3-32B, achieves 39.4% Pass@1 on SWE-Bench Verified with more than 2× cost reduction, driven by a 1.55× speedup from the Async Pipeline dispatcher and an AST-based tool-guided training recipe, while generalizing to tasks like Terminal-Bench, WebArena, and BrowseComp-Plus. Additional agents (Deep Research, Memory, Computer Use) demonstrate the framework’s versatility across backends and toolsets, underscoring SkyRL-Agent’s promise for scalable, tool-augmented, long-horizon RL research. Overall, the work addresses key bottlenecks in agent rollout orchestration and tool integration, offering practical, cost-efficient pathways to develop capable multi-turn LLM agents.

Abstract

We introduce SkyRL-Agent, a framework for efficient, multi-turn, long-horizon agent training and evaluation. It provides efficient asynchronous dispatching, lightweight tool integration, and flexible backend interoperability, enabling seamless use with existing RL frameworks such as SkyRL-train, VeRL, and Tinker. Using SkyRL-Agent, we train SA-SWE-32B, a software engineering agent trained from Qwen3-32B (24.4% Pass@1) purely with reinforcement learning. We introduce two key components: an optimized asynchronous pipeline dispatcher that achieves a 1.55x speedup over naive asynchronous batching, and a tool-enhanced training recipe leveraging an AST-based search tool to facilitate code navigation, boost rollout Pass@K, and improve training efficiency. Together, these optimizations enable SA-SWE-32B to reach 39.4% Pass@1 on SWE-Bench Verified with more than 2x cost reduction compared to prior models reaching similar performance. Despite being trained solely on SWE tasks, SA-SWE-32B generalizes effectively to other agentic tasks, including Terminal-Bench, BrowseComp-Plus, and WebArena. We further demonstrate SkyRL-Agent's extensibility through case studies on deep research, computer use, and memory agents, each trained using a different training backend.

SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent

TL;DR

SkyRL-Agent introduces a modular, backend-agnostic framework for efficient multi-turn RL training of LLM-based agents, emphasizing a tool-centric loop, a fine-grained asynchronous dispatcher, and a backend bridge for interoperability. The approach enables scalable, heterogeneous rollout scheduling and robust transition-based data capture, improving throughput and sample efficiency. The SWE case study SA-SWE-32B, trained purely with RL from Qwen3-32B, achieves 39.4% Pass@1 on SWE-Bench Verified with more than 2× cost reduction, driven by a 1.55× speedup from the Async Pipeline dispatcher and an AST-based tool-guided training recipe, while generalizing to tasks like Terminal-Bench, WebArena, and BrowseComp-Plus. Additional agents (Deep Research, Memory, Computer Use) demonstrate the framework’s versatility across backends and toolsets, underscoring SkyRL-Agent’s promise for scalable, tool-augmented, long-horizon RL research. Overall, the work addresses key bottlenecks in agent rollout orchestration and tool integration, offering practical, cost-efficient pathways to develop capable multi-turn LLM agents.

Abstract

We introduce SkyRL-Agent, a framework for efficient, multi-turn, long-horizon agent training and evaluation. It provides efficient asynchronous dispatching, lightweight tool integration, and flexible backend interoperability, enabling seamless use with existing RL frameworks such as SkyRL-train, VeRL, and Tinker. Using SkyRL-Agent, we train SA-SWE-32B, a software engineering agent trained from Qwen3-32B (24.4% Pass@1) purely with reinforcement learning. We introduce two key components: an optimized asynchronous pipeline dispatcher that achieves a 1.55x speedup over naive asynchronous batching, and a tool-enhanced training recipe leveraging an AST-based search tool to facilitate code navigation, boost rollout Pass@K, and improve training efficiency. Together, these optimizations enable SA-SWE-32B to reach 39.4% Pass@1 on SWE-Bench Verified with more than 2x cost reduction compared to prior models reaching similar performance. Despite being trained solely on SWE tasks, SA-SWE-32B generalizes effectively to other agentic tasks, including Terminal-Bench, BrowseComp-Plus, and WebArena. We further demonstrate SkyRL-Agent's extensibility through case studies on deep research, computer use, and memory agents, each trained using a different training backend.

Paper Structure

This paper contains 27 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Training dynamics and system performance for SA-SWE-32B. (a) compares training metrics across ablations; (b) shows GPU utilization comparison between asynchronous dispatching strategies.
  • Figure 2: Overview Architecture of SkyRL-Agent. The framework decomposes each rollout into three stages: (1) runtime initialization for tool execution runtime setup, (2) agent run where the agent performs actions through the tool interface, and (3) reward calculation for outcome evaluation. During execution, the inputs and outputs of LLM calls are recorded as transitions and stored in a buffer, while post_process aggregates these transitions together with their rewards into formatted data compatible with multiple RL training backends such as SkyRL-train, VeRL, and Tinker. The dispatcher schedules jobs across the three stages according to predefined policies.
  • Figure 3: Examples of Supported Dispatching Methods. Async Batch is normally used for reasoning tasks where runtime initialization and reward computation are lightweight. Async Batch (Bounded) schedules trajectories sequentially with capped concurrency, leading to unbalanced GPU utilization across stages, but remains effective when runtime reset is inexpensive, such as in computer-use tasks. Async Pipeline overlaps the three stages to maintain high GPU utilization, suitable for tasks with expensive runtime or reward stages.
  • Figure 4: Illustration of an SWE Agent.
  • Figure 5: Training Curve for Deep Research Agent on SkyRL-train backend.
  • ...and 2 more figures