CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

Zhihang Lin; Mingbao Lin; Yuan Xie; Rongrong Ji

CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

Zhihang Lin, Mingbao Lin, Yuan Xie, Rongrong Ji

TL;DR

This work tackles the high training cost of GRPO-based reasoning models caused by sampling many completions per prompt. It introduces Completion Pruning Policy Optimization (CPPO), which prunes completions based on their absolute advantage $|A_i|$ by enforcing a threshold $\gamma$, thereby reducing forward passes and gradient computations. A dynamic completion allocation strategy further improves GPU utilization by continuously feeding devices with high-value completions from new questions. Empirical results on GSM8K and MATH show up to $7.98\times$ speedup with CPPO while preserving or improving accuracy, and CPPO generalizes to other RL algorithms and backbones, illustrating a practical path to scalable, efficient reasoning model training.

Abstract

This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need to sample multiple completions for each question. Our experiment and theoretical analysis reveal that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training -- their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experiments show that CPPO achieves up to $7.98\times$ speedup on GSM8K and $3.48\times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at \href{https://github.com/lzhxmu/CPPO}{https://github.com/lzhxmu/CPPO}.

CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

TL;DR

Abstract

CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)