Table of Contents
Fetching ...

LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training

Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk Helenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, Xiaocheng Tang, Yundi Qian, Beibei Zhu, Rui Hou

TL;DR

<3-5 sentence high-level summary>LlamaRL addresses the computational and memory bottlenecks of post-training RL for extremely large language models by introducing a fully distributed, asynchronous RL framework built in native PyTorch with a single controller. It combines co-located offloading, asynchronous off-policy RL (AIPO), fine-grained parallelism, quantization, and a GPU-native distributed weight synchronization method (DDMA) to decouple training and generation across independent GPU groups, achieving a formal speedup guarantee and practical gains. Empirically, it delivers up to $10.7\times$ speedup on a $405B$-parameter policy and shows growing efficiency with model scale, while maintaining or improving policy quality on standard benchmarks; off-policy corrections are shown to stabilize training. Theoretical analysis supports that asynchronous design yields strictly better step times under realistic memory constraints, and the framework scales to thousands of GPUs, making large-scale RL for LLMs more feasible and extensible for future research and practice.

Abstract

Reinforcement Learning (RL) has become the most effective post-training approach for improving the capabilities of Large Language Models (LLMs). In practice, because of the high demands on latency and memory, it is particularly challenging to develop an efficient RL framework that reliably manages policy models with hundreds to thousands of billions of parameters. In this paper, we present LlamaRL, a fully distributed, asynchronous RL framework optimized for efficient training of large-scale LLMs with various model sizes (8B, 70B, and 405B parameters) on GPU clusters ranging from a handful to thousands of devices. LlamaRL introduces a streamlined, single-controller architecture built entirely on native PyTorch, enabling modularity, ease of use, and seamless scalability to thousands of GPUs. We also provide a theoretical analysis of LlamaRL's efficiency, including a formal proof that its asynchronous design leads to strict RL speed-up. Empirically during the Llama 3 post-training, by leveraging best practices such as colocated model offloading, asynchronous off-policy training, and distributed direct memory access for weight synchronization, LlamaRL achieves significant efficiency gains -- up to 10.7x speed-up compared to DeepSpeed-Chat-like systems on a 405B-parameter policy model. Furthermore, the efficiency advantage continues to grow with increasing model scale, demonstrating the framework's suitability for future large-scale RL training.

LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training

TL;DR

<3-5 sentence high-level summary>LlamaRL addresses the computational and memory bottlenecks of post-training RL for extremely large language models by introducing a fully distributed, asynchronous RL framework built in native PyTorch with a single controller. It combines co-located offloading, asynchronous off-policy RL (AIPO), fine-grained parallelism, quantization, and a GPU-native distributed weight synchronization method (DDMA) to decouple training and generation across independent GPU groups, achieving a formal speedup guarantee and practical gains. Empirically, it delivers up to speedup on a -parameter policy and shows growing efficiency with model scale, while maintaining or improving policy quality on standard benchmarks; off-policy corrections are shown to stabilize training. Theoretical analysis supports that asynchronous design yields strictly better step times under realistic memory constraints, and the framework scales to thousands of GPUs, making large-scale RL for LLMs more feasible and extensible for future research and practice.

Abstract

Reinforcement Learning (RL) has become the most effective post-training approach for improving the capabilities of Large Language Models (LLMs). In practice, because of the high demands on latency and memory, it is particularly challenging to develop an efficient RL framework that reliably manages policy models with hundreds to thousands of billions of parameters. In this paper, we present LlamaRL, a fully distributed, asynchronous RL framework optimized for efficient training of large-scale LLMs with various model sizes (8B, 70B, and 405B parameters) on GPU clusters ranging from a handful to thousands of devices. LlamaRL introduces a streamlined, single-controller architecture built entirely on native PyTorch, enabling modularity, ease of use, and seamless scalability to thousands of GPUs. We also provide a theoretical analysis of LlamaRL's efficiency, including a formal proof that its asynchronous design leads to strict RL speed-up. Empirically during the Llama 3 post-training, by leveraging best practices such as colocated model offloading, asynchronous off-policy training, and distributed direct memory access for weight synchronization, LlamaRL achieves significant efficiency gains -- up to 10.7x speed-up compared to DeepSpeed-Chat-like systems on a 405B-parameter policy model. Furthermore, the efficiency advantage continues to grow with increasing model scale, demonstrating the framework's suitability for future large-scale RL training.

Paper Structure

This paper contains 45 sections, 5 theorems, 28 equations, 8 figures, 4 tables, 2 algorithms.

Key Result

Theorem 7.1

Given the same hardware budget and memory constraints, LlamaRL can be configured to perform reinforcement learning (RL) strictly faster than any possible configuration of a traditional synchronous RL framework.

Figures (8)

  • Figure 1: An example flow of online RL training. A reference model for regularization is dropped in this flow for simplicity. The example flow also foregoes a learned critic model, and instead estimates the baseline from group scores to calculate the advantage functions for policy update. The example makes use of rule-based scorers, as is often the case for code and reasoning applications. The policy model has two instances, implementing based on the Fully Sharded Data Parallel (FSDP) and CUDA Graph for training and inference optimizations, respectively.
  • Figure 2: Process demonstrations of (a) synchronous on-policy RL, and (b) asynchronous off-policy RL. For asynchronous RL, the generator and trainer run in parallel without blocking one another, in contrast to synchronous RL. This design accelerates the overall training process significantly, without compromising model quality.
  • Figure 3: An example of LlamaRL architecture with a communication channel between two executors, managed by a single controller.
  • Figure 4: Model weights synchronization via distributed direct memory access (DDMA).
  • Figure 5: Empirical verification for Assumption \ref{['ass']} on batch size scaling with the 70B model. Left: Training time per 128 samples decreases with increasing microbatch size. Right: Generation time per 64 completions decreases with increasing maximum decoding concurrency. Both results illustrate sub-linear growth in total processing time, supporting the assumption that per-sample time ($\eta_t$, $\eta_g$) decreases with batch size.
  • ...and 3 more figures

Theorems & Definitions (14)

  • Theorem 7.1: Theoretical speed up of LlamaRL, informal version of Theorem \ref{['main_theorem']}
  • Definition 7.2
  • Definition 7.3: Processing time
  • Definition 7.4: RL step time
  • Theorem 7.5: Theoretical speed-up of LlamaRL
  • Remark 7.1
  • proof
  • Remark 7.2
  • Remark 7.3
  • Lemma B.1
  • ...and 4 more