GCPO: When Contrast Fails, Go Gold
Hao Wu, Wei Liu
TL;DR
The paper tackles sample inefficiency and update direction failures in reinforcement-learning-based post-training for reasoning in smaller LLMs by introducing GCPO, which injects a golden reference answer when all rollouts fail and uses sequence-level importance sampling to align rewards with policy updates. By removing KL penalties and leveraging a GA sourced from ground-truth data or a larger model, GCPO speeds convergence and improves generalization, enabling smaller models to emulate the reasoning strategies of larger models on math benchmarks. Empirical results show substantial gains over DAPO and baseline across multiple datasets, with notable improvements on AIME-2024 and MQA, and an ablation study highlighting the value of GA guidance and sequence-level IS. The work advances data-efficient, reasoning-focused RLHF training and offers a practical path for broader application, including tool-using Chain-of-Thought scenarios, with code available at the provided repository.
Abstract
Reinforcement learning has been widely applied to enhance the reasoning capabilities of large language models. Extending the inference limits of smaller models has become a prominent research focus. However, algorithms such as Group Relative Policy Optimization (GRPO) suffer from a clear drawback: the upper bound of a model's rollout responses is entirely determined by the model itself, preventing the acquisition of knowledge from samples that are either all incorrect or all correct. In this paper, we introduce Group Contrastive Policy Optimization (GCPO), a method that incorporates external standard reference answers. When the model cannot solve a problem, the reference answer supplies the correct response, steering the model toward an unequivocally accurate update direction. This approach offers two main advantages: (1) it improves training efficiency by fully utilizing every sample; (2) it enables the model to emulate the problem solving strategy of the reference answer during training, thereby enhancing generalization in reasoning. GCPO achieves outstanding results across multiple benchmark datasets, yielding substantial improvements over the baseline model. Our code is available at: https://github.com/AchoWu/GCPO.
