Table of Contents
Fetching ...

Optimizing RLHF Training for Large Language Models with Stage Fusion

Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, Xin Jin

TL;DR

RLHFuse tackles two core inefficiencies in RLHF training: long-tail generation causing downstream idle time and pipeline bubbles in large-scale training. By introducing data-aware inter-stage fusion (sample-level overlap between generation and inference) and model-aware intra-stage fusion (fused bi-directional schedules for Actor and Critic micro-batches), it significantly improves GPU utilization. The approach relies on a migration-threshold mechanism and simulated annealing-based schedule optimization, backed by production-oriented system optimizations. Empirical results on 13B–65B LLaMA models show throughput gains up to 3.7x over state-of-the-art baselines, along with close-to-lower-bound latency and memory performance in fused schedules, indicating strong practical impact for scalable RLHF deployment.

Abstract

We present RLHFuse, an efficient training system with stage fusion for Reinforcement Learning from Human Feedback (RLHF). Due to the intrinsic nature of RLHF training, i.e., the data skewness in the generation stage and the pipeline bubbles in the training stage, existing RLHF systems suffer from low GPU utilization. RLHFuse breaks the traditional view of RLHF workflow as a composition of individual tasks, splitting each task into finer-grained subtasks, and performing stage fusion to improve GPU utilization. RLHFuse contains two key ideas. First, for generation and inference tasks, RLHFuse splits them into sample-level subtasks, enabling efficient inter-stage fusion to overlap the execution of generation and inference stages, thus mitigating the original generation bottleneck dominated by long-tailed samples. Second, for training tasks, RLHFuse breaks them into subtasks of micro-batches and performs intra-stage fusion to concurrently execute these subtasks in the training stage with a fused pipeline schedule, effectively mitigating the pipeline bubbles. The experiments show that RLHFuse increases the training throughput by up to $3.7\times$, compared to existing systems.

Optimizing RLHF Training for Large Language Models with Stage Fusion

TL;DR

RLHFuse tackles two core inefficiencies in RLHF training: long-tail generation causing downstream idle time and pipeline bubbles in large-scale training. By introducing data-aware inter-stage fusion (sample-level overlap between generation and inference) and model-aware intra-stage fusion (fused bi-directional schedules for Actor and Critic micro-batches), it significantly improves GPU utilization. The approach relies on a migration-threshold mechanism and simulated annealing-based schedule optimization, backed by production-oriented system optimizations. Empirical results on 13B–65B LLaMA models show throughput gains up to 3.7x over state-of-the-art baselines, along with close-to-lower-bound latency and memory performance in fused schedules, indicating strong practical impact for scalable RLHF deployment.

Abstract

We present RLHFuse, an efficient training system with stage fusion for Reinforcement Learning from Human Feedback (RLHF). Due to the intrinsic nature of RLHF training, i.e., the data skewness in the generation stage and the pipeline bubbles in the training stage, existing RLHF systems suffer from low GPU utilization. RLHFuse breaks the traditional view of RLHF workflow as a composition of individual tasks, splitting each task into finer-grained subtasks, and performing stage fusion to improve GPU utilization. RLHFuse contains two key ideas. First, for generation and inference tasks, RLHFuse splits them into sample-level subtasks, enabling efficient inter-stage fusion to overlap the execution of generation and inference stages, thus mitigating the original generation bottleneck dominated by long-tailed samples. Second, for training tasks, RLHFuse breaks them into subtasks of micro-batches and performs intra-stage fusion to concurrently execute these subtasks in the training stage with a fused pipeline schedule, effectively mitigating the pipeline bubbles. The experiments show that RLHFuse increases the training throughput by up to , compared to existing systems.
Paper Structure (19 sections, 10 figures, 3 tables, 3 algorithms)

This paper contains 19 sections, 10 figures, 3 tables, 3 algorithms.

Figures (10)

  • Figure 1: RLHF workflow ouyang2022traininglanguagemodelsfollow and the problems in the generation and training stages of existing RLHF training frameworks.
  • Figure 2: Left: The output length CDF of models in the LMSYS-Chat-1M dataset. The vertical dotted line indicates the P99.9 output length. Right: The RLHF training iteration breakdown on the internal model and datasets under different maximum output lengths.
  • Figure 3: The timeline of 1F1B pipedream and interleaved 1F1B shoeybi2020megatronlm pipeline schedule with 4 pipeline stages and 4 micro-batches.
  • Figure 4: RLHFuse architecture.
  • Figure 5: The timeline of serial (top) and fused (bottom) execution of generation and inference stages.
  • ...and 5 more figures