LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training
Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk Helenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, Xiaocheng Tang, Yundi Qian, Beibei Zhu, Rui Hou
TL;DR
<3-5 sentence high-level summary>LlamaRL addresses the computational and memory bottlenecks of post-training RL for extremely large language models by introducing a fully distributed, asynchronous RL framework built in native PyTorch with a single controller. It combines co-located offloading, asynchronous off-policy RL (AIPO), fine-grained parallelism, quantization, and a GPU-native distributed weight synchronization method (DDMA) to decouple training and generation across independent GPU groups, achieving a formal speedup guarantee and practical gains. Empirically, it delivers up to $10.7\times$ speedup on a $405B$-parameter policy and shows growing efficiency with model scale, while maintaining or improving policy quality on standard benchmarks; off-policy corrections are shown to stabilize training. Theoretical analysis supports that asynchronous design yields strictly better step times under realistic memory constraints, and the framework scales to thousands of GPUs, making large-scale RL for LLMs more feasible and extensible for future research and practice.
Abstract
Reinforcement Learning (RL) has become the most effective post-training approach for improving the capabilities of Large Language Models (LLMs). In practice, because of the high demands on latency and memory, it is particularly challenging to develop an efficient RL framework that reliably manages policy models with hundreds to thousands of billions of parameters. In this paper, we present LlamaRL, a fully distributed, asynchronous RL framework optimized for efficient training of large-scale LLMs with various model sizes (8B, 70B, and 405B parameters) on GPU clusters ranging from a handful to thousands of devices. LlamaRL introduces a streamlined, single-controller architecture built entirely on native PyTorch, enabling modularity, ease of use, and seamless scalability to thousands of GPUs. We also provide a theoretical analysis of LlamaRL's efficiency, including a formal proof that its asynchronous design leads to strict RL speed-up. Empirically during the Llama 3 post-training, by leveraging best practices such as colocated model offloading, asynchronous off-policy training, and distributed direct memory access for weight synchronization, LlamaRL achieves significant efficiency gains -- up to 10.7x speed-up compared to DeepSpeed-Chat-like systems on a 405B-parameter policy model. Furthermore, the efficiency advantage continues to grow with increasing model scale, demonstrating the framework's suitability for future large-scale RL training.
