Table of Contents
Fetching ...

ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents

Hao Zhang, Mingjie Liu, Shaokun Zhang, Songyang Han, Jian Hu, Zhenghui Jin, Yuchi Zhang, Shizhe Diao, Ximing Lu, Binfeng Xu, Zhiding Yu, Jan Kautz, Yi Dong

Abstract

Multi-turn LLM agents are increasingly important for solving complex, interactive tasks, and reinforcement learning (RL) is a key ingredient for improving their long-horizon behavior. However, RL training requires generating large numbers of sandboxed rollout trajectories, and existing infrastructures often couple rollout orchestration with the training loop, making systems hard to migrate and maintain. Under the rollout-as-a-service philosophy, we present ProRL Agent , a scalable infrastructure that serves the full agentic rollout lifecycle through an API service. ProRL Agent also provides standardized and extensible sandbox environments that support diverse agentic tasks in rootless HPC settings. We validate ProRL Agent through RL training on software engineering, math, STEM, and coding tasks. ProRL Agent is open-sourced and integrated as part of NVIDIA NeMo Gym.

ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents

Abstract

Multi-turn LLM agents are increasingly important for solving complex, interactive tasks, and reinforcement learning (RL) is a key ingredient for improving their long-horizon behavior. However, RL training requires generating large numbers of sandboxed rollout trajectories, and existing infrastructures often couple rollout orchestration with the training loop, making systems hard to migrate and maintain. Under the rollout-as-a-service philosophy, we present ProRL Agent , a scalable infrastructure that serves the full agentic rollout lifecycle through an API service. ProRL Agent also provides standardized and extensible sandbox environments that support diverse agentic tasks in rootless HPC settings. We validate ProRL Agent through RL training on software engineering, math, STEM, and coding tasks. ProRL Agent is open-sourced and integrated as part of NVIDIA NeMo Gym.
Paper Structure (23 sections, 1 equation, 11 figures, 3 tables)

This paper contains 23 sections, 1 equation, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Coupled vs. decoupled designs.Left: Existing frameworks often embed the full agentic rollout lifecycle inside the RL training stack. Right:ProRL Agent treats rollout as an independent HTTP service. The trainer submits rollout requests and receives completed trajectories and rewards, while the rollout server handles environment execution, tool use, evaluation, and inference coordination. This decoupled design improves resource isolation, portability, and extensibility.
  • Figure 2: Overview of the ProRL Agent architecture. The system consists of three components. (1) Sandbox Environment: each rollout is executed inside a SingularityRuntime container and orchestrated via AgentHandler, which exposes three lifecycle methods including init(), run(), and eval() for environment setup, multi-turn agent execution, and reward scoring, respectively. (2) ProRL Agent Server: an HTTP service that manages rollouts through a three-stage asynchronous pipeline (INIT $\to$ RUN $\to$ EVAL) with independent worker pools, and maintains a min-heap LLM backend pool supporting dynamic registration and checkpoint swapping. (3) RL Trainer: any training framework (e.g., veRL, NeMo RL) interacts with the server solely via HTTP, submitting jobs via POST process and managing backends via add_llm_server and /cancel; completed trajectories and rewards are returned to the trainer to update the policy.
  • Figure 3: Comparison of DAPO implementations ($n=4$). Our efficient implementation optimizes worker synchronization, significantly reducing the idle time (waiting period) between rollout generations compared to the baseline batch-by-batch approach.
  • Figure 4: Training curves for ProRL Agent across three agent domains. From left to right: mean reward during RL training of the STEM agent, Pass@1 on AMC during RL training of the math agent, and Pass@1 on Codeforces during RL training of the code agent. All three curves show steady improvement during training, demonstrating the generality of ProRL Agent beyond software engineering tasks.
  • Figure 5: Rollout throughput (instances/sec) on software engineering tasks versus the number of compute nodes. The near-linear increase in throughput demonstrates that ProRL Agent scales efficiently with additional compute resources.
  • ...and 6 more figures