Table of Contents
Fetching ...

EARL: Efficient Agentic Reinforcement Learning Systems for Large Language Models

Zheyue Tan, Mustapha Abdullahi, Tuo Shi, Huining Yuan, Zelai Xu, Chao Yu, Boxun Li, Bo Zhao

TL;DR

Context length growth in agentic RL for LLMs creates memory and interconnect bottlenecks, constraining scalability. EARL introduces a Parallelism Selector to adapt tensor-parallelism across RL stages and a Data Dispatcher to enable layout-aware, decentralized data transfers, replacing centralized all-gather. The approach scales training to thousands of GPUs and lifts context-length limitations by reducing OOM risk and inter-stage latency, demonstrated on a 16-machine cluster with Qwen2.5-72B-Instruct in Connect Four, achieving substantial throughput gains. This work offers practical system-level improvements for scalable agentic RL in real-world deployments and opens paths to more capable, tool-using LLM agents.

Abstract

Reinforcement learning (RL) has become a pivotal component of large language model (LLM) post-training, and agentic RL extends this paradigm to operate as agents through multi-turn interaction and tool use. Scaling such systems exposes two practical bottlenecks: (1) context length grows rapidly during training, inflating memory usage and latency, and triggering out-of-memory (OOM) failures; and (2) intermediate tensors accumulate with context length, making cross-device data movement a major system bottleneck. We present EARL, a scalable system for efficient agentic RL. EARL designs a parallelism selector that dynamically adapts model and training parallelism across RL stages based on sequence length and system load, and a data dispatcher that performs layout-aware, decentralized exchange of intermediate data batches. Together, these components increase throughput, reduce long-context failures, and enable stable large-scale training of agentic LLMs without relying on hard limits or penalties of context length.

EARL: Efficient Agentic Reinforcement Learning Systems for Large Language Models

TL;DR

Context length growth in agentic RL for LLMs creates memory and interconnect bottlenecks, constraining scalability. EARL introduces a Parallelism Selector to adapt tensor-parallelism across RL stages and a Data Dispatcher to enable layout-aware, decentralized data transfers, replacing centralized all-gather. The approach scales training to thousands of GPUs and lifts context-length limitations by reducing OOM risk and inter-stage latency, demonstrated on a 16-machine cluster with Qwen2.5-72B-Instruct in Connect Four, achieving substantial throughput gains. This work offers practical system-level improvements for scalable agentic RL in real-world deployments and opens paths to more capable, tool-using LLM agents.

Abstract

Reinforcement learning (RL) has become a pivotal component of large language model (LLM) post-training, and agentic RL extends this paradigm to operate as agents through multi-turn interaction and tool use. Scaling such systems exposes two practical bottlenecks: (1) context length grows rapidly during training, inflating memory usage and latency, and triggering out-of-memory (OOM) failures; and (2) intermediate tensors accumulate with context length, making cross-device data movement a major system bottleneck. We present EARL, a scalable system for efficient agentic RL. EARL designs a parallelism selector that dynamically adapts model and training parallelism across RL stages based on sequence length and system load, and a data dispatcher that performs layout-aware, decentralized exchange of intermediate data batches. Together, these components increase throughput, reduce long-context failures, and enable stable large-scale training of agentic LLMs without relying on hard limits or penalties of context length.

Paper Structure

This paper contains 10 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Training a 4B-parameter LLM on the Tic-Tac-Toe task: (a) turn-level context length steadily increases; (b) episode-level context length quickly reaches the system limit; and (c) the model performance collapses due to context truncation.
  • Figure 2: System design of Earl.
  • Figure 3: Relative throughput speedup from $TP=4$ to $TP=8$ across different context lengths and response counts, computed using Equation \ref{['eq:speedup']}. Positive values indicate TP8 outperforms TP4; negative values indicate TP4 outperforms TP8.
  • Figure 4: Data dispatch latency of baseline and Earl under different context lengths. Numbers above the bars indicate the relative latency reduction of Earl compared to the baseline.