Table of Contents
Fetching ...

An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training

Youshao Xiao, Zhenglei Zhou, Fagui Mao, Weichang Wu, Shangchun Zhao, Lin Ju, Lei Liang, Xiaolu Zhang, Jun Zhou

TL;DR

RLHF training suffers from inefficiencies when using a fixed Co-located placement across four interdependent models. The authors present FlexRLHF, comprising Interleaving and Disaggregated model placement strategies and a FlexRLHF Execution Engine to decouple training and inference runtimes, along with practical configuration guidelines. Empirical results show up to 11× throughput improvements over state-of-the-art frameworks (DeepSpeed-Chat and trlX) across model sizes and hardware heterogeneity. The work provides a concrete, adaptable framework and guidance to accelerate distributed RLHF training in real-world deployments.

Abstract

Recently, ChatGPT or InstructGPT like large language models (LLM) has made a significant impact in the AI world. Many works have attempted to reproduce the complex InstructGPT's training pipeline, namely Reinforcement Learning with Human Feedback (RLHF). However, the mainstream distributed RLHF training methods typically adopt a fixed model placement strategy, referred to as the Co-located strategy. This strategy treats all four interdependent models involved in RLHF as a single entity, distributing them across all devices and applying parallelism techniques designed for a single model, regardless of the workload heterogeneity inherent to each model. As a result, this strategy exacerbates the generation bottlenecks in the RLHF training and degrades the overall training efficiency. To address these issues, we propose a flexible model placement framework that offers two general and agile model placement strategies. The Interleaving strategy helps reduce memory redundancy and communication costs of RLHF training by placing models without dependencies on exclusive devices with careful orchestration. On the other hand, the Disaggregated strategy improves the throughput of model training by separating the training and inference runtime of the RLHF pipeline with additional shadow models. Furthermore, our framework provides a simple user interface and guidelines to easily and flexibly configure these strategies in various training scenarios. Our experiments have shown that our strategy can achieve notable improvements up to 11x, compared to the current state-of-the-art (SOTA) approaches. The results highlight the effectiveness and adaptability of our methods in accelerating the training of distributed RLHF.

An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training

TL;DR

RLHF training suffers from inefficiencies when using a fixed Co-located placement across four interdependent models. The authors present FlexRLHF, comprising Interleaving and Disaggregated model placement strategies and a FlexRLHF Execution Engine to decouple training and inference runtimes, along with practical configuration guidelines. Empirical results show up to 11× throughput improvements over state-of-the-art frameworks (DeepSpeed-Chat and trlX) across model sizes and hardware heterogeneity. The work provides a concrete, adaptable framework and guidance to accelerate distributed RLHF training in real-world deployments.

Abstract

Recently, ChatGPT or InstructGPT like large language models (LLM) has made a significant impact in the AI world. Many works have attempted to reproduce the complex InstructGPT's training pipeline, namely Reinforcement Learning with Human Feedback (RLHF). However, the mainstream distributed RLHF training methods typically adopt a fixed model placement strategy, referred to as the Co-located strategy. This strategy treats all four interdependent models involved in RLHF as a single entity, distributing them across all devices and applying parallelism techniques designed for a single model, regardless of the workload heterogeneity inherent to each model. As a result, this strategy exacerbates the generation bottlenecks in the RLHF training and degrades the overall training efficiency. To address these issues, we propose a flexible model placement framework that offers two general and agile model placement strategies. The Interleaving strategy helps reduce memory redundancy and communication costs of RLHF training by placing models without dependencies on exclusive devices with careful orchestration. On the other hand, the Disaggregated strategy improves the throughput of model training by separating the training and inference runtime of the RLHF pipeline with additional shadow models. Furthermore, our framework provides a simple user interface and guidelines to easily and flexibly configure these strategies in various training scenarios. Our experiments have shown that our strategy can achieve notable improvements up to 11x, compared to the current state-of-the-art (SOTA) approaches. The results highlight the effectiveness and adaptability of our methods in accelerating the training of distributed RLHF.
Paper Structure (21 sections, 9 figures, 5 tables, 2 algorithms)

This paper contains 21 sections, 9 figures, 5 tables, 2 algorithms.

Figures (9)

  • Figure 1: Workflow of RLHF and Percentage (in %) of Each Stage Duration.
  • Figure 2: The architecture of Model Placement Strategies, where (a) represents the Co-located strategy, (b) represents the Interleaving strategy, (c) demonstrates the Disaggregated strategy used for homogeneous devices, where generation and training models are assigned to exclusive groups of devices, and (d) showcases the Disaggregated strategy used for heterogeneous devices, where inference models are allocated to dedicated groups of devices specialized for inference.
  • Figure 3: The timeline of (a) Co-located strategy vs. (b) Interleaving strategy using two homogeneous GPU devices. We designate "Worker 1" or "W1" to represent GPU device 1 and #=4 means that the number of micro batches is 4. Under the Co-located strategy, both the Ref model and Reward model are deployed on both devices, whereas under the Interleaving strategy, the Ref model and Reward model are allocated to separate groups or devices. The efficiency improvement of our strategy is attributed to two reasons: i) Reduced memory redundancy: In the Co-located strategy (a), four models are allocated on both W$1$ and W$2$, while in the Interleaving strategy (b), the Ref model and Reward model are placed on W$1$ and W$2$ exclusively. This reduces the memory redundancy of parallelism, e.g., low-level ZeRO Parallelism, by reducing participating devices for the Reward model or Ref model from $2$ to $1$. ii) Reduced communication cost: As the Interleaving strategy assigns the Ref model and Reward model to two separate GPU devices, it enables independent and parallel forward computation without communication between two devices.
  • Figure 4: The timeline of Disaggregated strategy on homogeneous (a) and heterogeneous GPU devices (b). As illustrated in the figure, we set up shadow critic and shadow actor models and placed them on separate GPU devices (as shown in figure (a) placed on W3 and W4, or in figure (b) placed on W4 and W6,7), thereby decoupling the training and inference runtime in the RLHF pipeline. Additionally, in the heterogeneous Disaggregated strategy in (b), these two stages are designated to specialized devices for training or inference purposes. As discussed in Section \ref{['sec:sep']}, our Disaggregated strategy could benefit from targeted optimization for each stage. Also, this strategy does not require waiting for the entire generation process to complete before proceeding with the Forward and Forward & Backward operation in a pipeline manner as shown in (b). For instance, in the pipeline execution process, the Forward tasks F1 and F2 can commence immediately following the completion of the Generation tasks G1 and G2, without being blocked by subsequent Generation tasks G3 and G4.
  • Figure 5: The user interface of FlexRLHF Execution Engine.
  • ...and 4 more figures