OPPO: Accelerating PPO-based RLHF via Pipeline Overlap
Kaizhuo Yan, Yingjie Yu, Yifan Yu, Haizhong Zheng, Fan Lai
TL;DR
OPPO addresses inefficiencies in PPO-based RLHF caused by sequential multi-model dependencies and long-tail response latency. It introduces intra-step overlap by streaming actor outputs to downstream reward models and inter-step overlap by overcommitting a few prompts to mask tail latency, both with dynamic control to balance efficiency and convergence. Across multiple tasks and model scales, OPPO achieves 1.8x–2.8x training speedups and 1.4x–2.1x higher GPU utilization while preserving convergence and final accuracy. This lightweight, model-agnostic approach offers a practical path to scalable RLHF for large language models, compatible with existing PPO frameworks and adaptable to related RLHF optimization strategies.
Abstract
Proximal Policy Optimization (PPO)-based reinforcement learning from human feedback (RLHF) is a widely adopted paradigm for aligning large language models (LLMs) with human preferences. However, its training pipeline suffers from substantial inefficiencies due to sequential multi-model dependencies (e.g., reward model depends on actor outputs) and long-tail response lengths, where a few long responses straggle the stage completion. We present OPPO, a novel, lightweight, and model-agnostic PPO-based RLHF framework that improves training efficiency by overlapping pipeline execution. OPPO introduces two novel techniques: (1) Intra-step overlap, which streams upstream model outputs (e.g., actor model) in right-sized chunks, enabling the downstream model (e.g., reward) to begin prefill while the upstream continues decoding; and (2) Inter-step overlap, which adaptively overcommits a few prompts and defers long generations to future steps, mitigating tail latency without discarding partial work. OPPO integrates easily with existing PPO implementations with a few lines of code change. Extensive evaluations show that OPPO accelerates PPO-based RLHF training by $1.8 \times-2.8 \times$ and improves GPU utilization by $1.4 \times-2.1 \times$ without compromising training convergence.
