RL in the Wild: Characterizing RLVR Training in LLM Deployment

Jiecheng Zhou; Qinghao Hu; Yuyang Jin; Zerui Wang; Peng Sun; Yuzhe Gu; Wenwei Zhang; Mingshu Zhai; Xingcheng Zhang; Weiming Zhang

RL in the Wild: Characterizing RLVR Training in LLM Deployment

Jiecheng Zhou, Qinghao Hu, Yuyang Jin, Zerui Wang, Peng Sun, Yuzhe Gu, Wenwei Zhang, Mingshu Zhai, Xingcheng Zhang, Weiming Zhang

TL;DR

This paper addresses the system-level challenges of reinforcement learning with verifiable rewards (RLVR) in large language model (LLM) deployment by conducting a workload-centric characterization of production RLVR tasks and introducing PolyTrace, a trace-based benchmark. It analyzes diverse workloads (vision, math, tool-use, and other tasks) to reveal long-tail sequence lengths, dynamic performance, and load-imbalance issues, then pairs these insights with a detailed fine-grained system analysis of rollout, training, memory, and scaling bottlenecks. The authors provide a public trace dataset and demonstrate that PolyTrace can evaluate RL frameworks under realistic, varied workloads, showing about 94.7% accuracy in a practical use case. The work offers actionable guidance for scheduling, data management, memory policies, and asynchronous training to improve efficiency and scalability in RLVR systems.

Abstract

Large Language Models (LLMs) are now widely used across many domains. With their rapid development, Reinforcement Learning with Verifiable Rewards (RLVR) has surged in recent months to enhance their reasoning and understanding abilities. However, its complex data flows and diverse tasks pose substantial challenges to RL training systems, and there is limited understanding of RLVR from a system perspective. To thoroughly understand the system challenges introduced by RLVR, we present a characterization study of RLVR tasks in our LLM deployment. Specifically, we investigate the distribution and variation trends of workloads across different RL tasks across training steps. We identify issues such as GPU idling caused by skewed sequence length distribution, inefficient parallel strategies in dynamically varying workloads, inefficient data management mechanisms, and load imbalance. We describe our observations and call for further investigation into the remaining open challenges. Furthermore, we propose PolyTrace benchmark suite to conduct evaluation with realistic workloads, and a practical use case validates that PolyTrace benchmark suite exhibits 94.7% accuracy.

RL in the Wild: Characterizing RLVR Training in LLM Deployment

TL;DR

Abstract

RL in the Wild: Characterizing RLVR Training in LLM Deployment

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (28)