Table of Contents
Fetching ...

The Peril of Preference: Why GRPO fails on Ordinal Rewards

Anisha Garg, Ganesh Venkatesh

TL;DR

The paper identifies a fundamental flaw in Group-relative Policy Optimization (GRPO) when training with ordinal rewards: the group-average baseline can yield a positive learning signal for failed trajectories, reinforcing sub-optimal behavior. It introduces Correctness Relative Policy Optimization (CoRPO), an adaptive baseline that enforces a correctness threshold and, once surpassed, transitions to a relative-preference regime to push toward optimal solutions. The authors formalize the two-phase CoRPO mechanism and validate it empirically on a code-verification RL task, showing improved stability and stronger out-of-domain generalization compared to GRPO and static baselines. They further conduct extensive ablations to analyze reward granularity, data filtering, and rollout strategies, illustrating CoRPO’s robustness to homogeneous failures and its capacity to expand the policy’s solution horizon beyond initial tendencies.

Abstract

Group-relative Policy Optimization's (GRPO) simplicity makes it highly desirable for adapting LLMs to become experts at specific tasks. But this simplicity also makes it ill-specified as we seek to enhance RL training with richer, non-binary feedback. When using ordinal rewards to give partial credit, GRPO's simplicity starts to hurt, as its group-average baseline often assigns a positive advantage to failed trajectories and reinforces incorrect behavior. We introduce Correctness Relative Policy Optimization (CoRPO), a new formulation that solves this flaw. CoRPO uses an adaptive baseline that enforces a minimum quality threshold, ensuring failed solutions are never positively reinforced. Once the policy consistently meets this threshold, the baseline automatically transitions to a relative preference mode, pushing the model to find optimal solutions rather than just "acceptable" ones. We empirically validate CoRPO on a code verification task, where it demonstrates more stable convergence and better out-of-domain generalization. This work represents a critical step in our broader research program to enable LLMs to learn genuinely new capabilities through reinforcement learning. We achieve this by enabling LLMs to learn from rich, multi-dimensional feedback - progressing from binary to ordinal rewards in this work, and onward to denser, per-step supervision.

The Peril of Preference: Why GRPO fails on Ordinal Rewards

TL;DR

The paper identifies a fundamental flaw in Group-relative Policy Optimization (GRPO) when training with ordinal rewards: the group-average baseline can yield a positive learning signal for failed trajectories, reinforcing sub-optimal behavior. It introduces Correctness Relative Policy Optimization (CoRPO), an adaptive baseline that enforces a correctness threshold and, once surpassed, transitions to a relative-preference regime to push toward optimal solutions. The authors formalize the two-phase CoRPO mechanism and validate it empirically on a code-verification RL task, showing improved stability and stronger out-of-domain generalization compared to GRPO and static baselines. They further conduct extensive ablations to analyze reward granularity, data filtering, and rollout strategies, illustrating CoRPO’s robustness to homogeneous failures and its capacity to expand the policy’s solution horizon beyond initial tendencies.

Abstract

Group-relative Policy Optimization's (GRPO) simplicity makes it highly desirable for adapting LLMs to become experts at specific tasks. But this simplicity also makes it ill-specified as we seek to enhance RL training with richer, non-binary feedback. When using ordinal rewards to give partial credit, GRPO's simplicity starts to hurt, as its group-average baseline often assigns a positive advantage to failed trajectories and reinforces incorrect behavior. We introduce Correctness Relative Policy Optimization (CoRPO), a new formulation that solves this flaw. CoRPO uses an adaptive baseline that enforces a minimum quality threshold, ensuring failed solutions are never positively reinforced. Once the policy consistently meets this threshold, the baseline automatically transitions to a relative preference mode, pushing the model to find optimal solutions rather than just "acceptable" ones. We empirically validate CoRPO on a code verification task, where it demonstrates more stable convergence and better out-of-domain generalization. This work represents a critical step in our broader research program to enable LLMs to learn genuinely new capabilities through reinforcement learning. We achieve this by enabling LLMs to learn from rich, multi-dimensional feedback - progressing from binary to ordinal rewards in this work, and onward to denser, per-step supervision.

Paper Structure

This paper contains 41 sections, 13 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Distribution of advantage sign vs. trajectory status for a representative batch using GRPO baseline. The 18% slice for "Failed Trajectory, Positive Advantage" is empirical evidence of the $b < R(y_f) < 0$ flaw.
  • Figure 2: Ratio of Positive to Negative Signals over training steps for - GRPO baseline, Static Correctness baseline (Static) and CoRPO proposal. Left figure shows $r_{count}$, the ratio in terms of the rollout counts. Right figure plots $r_{loss}$ which takes into account the advantage magnitude.
  • Figure 3: Impact of GRPO vs. CoRPO on policy distribution dynamics. The X-axis ranks correct solutions by their initial likelihood (High → Low). The Y-axis shows the Uplift Ratio (post-training probability / pre-training probability of correct solutions). (Left) GRPO exhibits distribution sharpening, disproportionately reinforcing high-probability trajectories as training progresses (Step 110 → 180). In contrast, CoRPO applies uniform reinforcement independent of starting probability. (Right) When unconstrained by weight decay (WD=0), CoRPO actively upweights lower-ranked (unlikely) trajectories, expanding the model's problem-solving horizon beyond its initial tendencies.