Table of Contents
Fetching ...

OPPO: Accelerating PPO-based RLHF via Pipeline Overlap

Kaizhuo Yan, Yingjie Yu, Yifan Yu, Haizhong Zheng, Fan Lai

TL;DR

OPPO addresses inefficiencies in PPO-based RLHF caused by sequential multi-model dependencies and long-tail response latency. It introduces intra-step overlap by streaming actor outputs to downstream reward models and inter-step overlap by overcommitting a few prompts to mask tail latency, both with dynamic control to balance efficiency and convergence. Across multiple tasks and model scales, OPPO achieves 1.8x–2.8x training speedups and 1.4x–2.1x higher GPU utilization while preserving convergence and final accuracy. This lightweight, model-agnostic approach offers a practical path to scalable RLHF for large language models, compatible with existing PPO frameworks and adaptable to related RLHF optimization strategies.

Abstract

Proximal Policy Optimization (PPO)-based reinforcement learning from human feedback (RLHF) is a widely adopted paradigm for aligning large language models (LLMs) with human preferences. However, its training pipeline suffers from substantial inefficiencies due to sequential multi-model dependencies (e.g., reward model depends on actor outputs) and long-tail response lengths, where a few long responses straggle the stage completion. We present OPPO, a novel, lightweight, and model-agnostic PPO-based RLHF framework that improves training efficiency by overlapping pipeline execution. OPPO introduces two novel techniques: (1) Intra-step overlap, which streams upstream model outputs (e.g., actor model) in right-sized chunks, enabling the downstream model (e.g., reward) to begin prefill while the upstream continues decoding; and (2) Inter-step overlap, which adaptively overcommits a few prompts and defers long generations to future steps, mitigating tail latency without discarding partial work. OPPO integrates easily with existing PPO implementations with a few lines of code change. Extensive evaluations show that OPPO accelerates PPO-based RLHF training by $1.8 \times-2.8 \times$ and improves GPU utilization by $1.4 \times-2.1 \times$ without compromising training convergence.

OPPO: Accelerating PPO-based RLHF via Pipeline Overlap

TL;DR

OPPO addresses inefficiencies in PPO-based RLHF caused by sequential multi-model dependencies and long-tail response latency. It introduces intra-step overlap by streaming actor outputs to downstream reward models and inter-step overlap by overcommitting a few prompts to mask tail latency, both with dynamic control to balance efficiency and convergence. Across multiple tasks and model scales, OPPO achieves 1.8x–2.8x training speedups and 1.4x–2.1x higher GPU utilization while preserving convergence and final accuracy. This lightweight, model-agnostic approach offers a practical path to scalable RLHF for large language models, compatible with existing PPO frameworks and adaptable to related RLHF optimization strategies.

Abstract

Proximal Policy Optimization (PPO)-based reinforcement learning from human feedback (RLHF) is a widely adopted paradigm for aligning large language models (LLMs) with human preferences. However, its training pipeline suffers from substantial inefficiencies due to sequential multi-model dependencies (e.g., reward model depends on actor outputs) and long-tail response lengths, where a few long responses straggle the stage completion. We present OPPO, a novel, lightweight, and model-agnostic PPO-based RLHF framework that improves training efficiency by overlapping pipeline execution. OPPO introduces two novel techniques: (1) Intra-step overlap, which streams upstream model outputs (e.g., actor model) in right-sized chunks, enabling the downstream model (e.g., reward) to begin prefill while the upstream continues decoding; and (2) Inter-step overlap, which adaptively overcommits a few prompts and defers long generations to future steps, mitigating tail latency without discarding partial work. OPPO integrates easily with existing PPO implementations with a few lines of code change. Extensive evaluations show that OPPO accelerates PPO-based RLHF training by and improves GPU utilization by without compromising training convergence.

Paper Structure

This paper contains 37 sections, 5 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: (a) In the existing paradigm, the scoring stage starts to process until that response is fully generated. In contrast, (b) the OPPO paradigm interleaves scoring with generation without altering the final responses (intra-step overlap), and carries unfinished overcommitted sequences into the next iteration (inter-step overlap). A batch size of 4 and an overcommitment degree of 2 in illustrations.
  • Figure 2: PPO-based RLHF faces (a) varying resource demands across pipeline stages, and (b) response lengths across rollouts, both of which can produce stragglers that prolong step execution. (c) Existing approaches for asynchronous training risk harming convergence.
  • Figure 3: OPPO improves PPO-based RLHF training efficiency by $1.8\times$--$2.8\times$ over TRL across datasets, enabled by overlapping actor generation with reward scoring and early stopping.
  • Figure 4: OPPO achieves efficiency gains without affecting training quality.
  • Figure 5: OPPO improves GPU utilization in the inference stage by $1.4\times$--$2.1\times$, enabling more efficient compute use by overlapping actor generation with reward scoring.
  • ...and 2 more figures