The Peril of Preference: Why GRPO fails on Ordinal Rewards
Anisha Garg, Ganesh Venkatesh
TL;DR
The paper identifies a fundamental flaw in Group-relative Policy Optimization (GRPO) when training with ordinal rewards: the group-average baseline can yield a positive learning signal for failed trajectories, reinforcing sub-optimal behavior. It introduces Correctness Relative Policy Optimization (CoRPO), an adaptive baseline that enforces a correctness threshold and, once surpassed, transitions to a relative-preference regime to push toward optimal solutions. The authors formalize the two-phase CoRPO mechanism and validate it empirically on a code-verification RL task, showing improved stability and stronger out-of-domain generalization compared to GRPO and static baselines. They further conduct extensive ablations to analyze reward granularity, data filtering, and rollout strategies, illustrating CoRPO’s robustness to homogeneous failures and its capacity to expand the policy’s solution horizon beyond initial tendencies.
Abstract
Group-relative Policy Optimization's (GRPO) simplicity makes it highly desirable for adapting LLMs to become experts at specific tasks. But this simplicity also makes it ill-specified as we seek to enhance RL training with richer, non-binary feedback. When using ordinal rewards to give partial credit, GRPO's simplicity starts to hurt, as its group-average baseline often assigns a positive advantage to failed trajectories and reinforces incorrect behavior. We introduce Correctness Relative Policy Optimization (CoRPO), a new formulation that solves this flaw. CoRPO uses an adaptive baseline that enforces a minimum quality threshold, ensuring failed solutions are never positively reinforced. Once the policy consistently meets this threshold, the baseline automatically transitions to a relative preference mode, pushing the model to find optimal solutions rather than just "acceptable" ones. We empirically validate CoRPO on a code verification task, where it demonstrates more stable convergence and better out-of-domain generalization. This work represents a critical step in our broader research program to enable LLMs to learn genuinely new capabilities through reinforcement learning. We achieve this by enabling LLMs to learn from rich, multi-dimensional feedback - progressing from binary to ordinal rewards in this work, and onward to denser, per-step supervision.
