Optimizing RLHF Training for Large Language Models with Stage Fusion
Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, Xin Jin
TL;DR
RLHFuse tackles two core inefficiencies in RLHF training: long-tail generation causing downstream idle time and pipeline bubbles in large-scale training. By introducing data-aware inter-stage fusion (sample-level overlap between generation and inference) and model-aware intra-stage fusion (fused bi-directional schedules for Actor and Critic micro-batches), it significantly improves GPU utilization. The approach relies on a migration-threshold mechanism and simulated annealing-based schedule optimization, backed by production-oriented system optimizations. Empirical results on 13B–65B LLaMA models show throughput gains up to 3.7x over state-of-the-art baselines, along with close-to-lower-bound latency and memory performance in fused schedules, indicating strong practical impact for scalable RLHF deployment.
Abstract
We present RLHFuse, an efficient training system with stage fusion for Reinforcement Learning from Human Feedback (RLHF). Due to the intrinsic nature of RLHF training, i.e., the data skewness in the generation stage and the pipeline bubbles in the training stage, existing RLHF systems suffer from low GPU utilization. RLHFuse breaks the traditional view of RLHF workflow as a composition of individual tasks, splitting each task into finer-grained subtasks, and performing stage fusion to improve GPU utilization. RLHFuse contains two key ideas. First, for generation and inference tasks, RLHFuse splits them into sample-level subtasks, enabling efficient inter-stage fusion to overlap the execution of generation and inference stages, thus mitigating the original generation bottleneck dominated by long-tailed samples. Second, for training tasks, RLHFuse breaks them into subtasks of micro-batches and performs intra-stage fusion to concurrently execute these subtasks in the training stage with a fused pipeline schedule, effectively mitigating the pipeline bubbles. The experiments show that RLHFuse increases the training throughput by up to $3.7\times$, compared to existing systems.
