Table of Contents
Fetching ...

GOPO: Policy Optimization using Ranked Rewards

Kyuseong Choi, Dwaipayan Saha, Woojeong Kim, Anish Agarwal, Raaz Dwivedi

TL;DR

This work tackles the unreliability of reward magnitudes in RLHF when rewards are non-verifiable and noisy. It proposes GOPO, a rank-based policy optimization method that discards reward magnitudes and uses within-prompt rank order, formalized by the rank-based advantage $\hat{A}_{i,t}^{\mathrm{rank}} := 2 - (\rho(i) - 1)\cdot \frac{4}{G - 1}$. Empirically, GOPO delivers higher training and validation rewards, improved LLM-as-judge win rates, and faster convergence than GRPO across TLDR, UltraChat, and IFEval with multiple base sizes. The analysis indicates that rank-based advantages can increase gradient norms for small $G$ but lead to robust, faster optimization and better calibration in practice. Overall, GOPO provides a simple, effective mechanism for RLHF in non-verifiable settings and suggests a shift toward ordinal information in policy updates.

Abstract

Standard reinforcement learning from human feedback (RLHF) trains a reward model on pairwise preference data and then uses it for policy optimization. However, while reward models are optimized to capture relative preferences, existing policy optimization techniques rely on absolute reward magnitudes during training. In settings where the rewards are non-verifiable such as summarization, instruction following, and chat completion, this misalignment often leads to suboptimal performance. We introduce Group Ordinal Policy Optimization (GOPO), a policy optimization method that uses only the ranking of the rewards and discards their magnitudes. Our rank-based transformation of rewards provides several gains, compared to Group Relative Policy Optimization (GRPO), in settings with non-verifiable rewards: (1) consistently higher training/validation reward trajectories, (2) improved LLM-as-judge evaluations across most intermediate training steps, and (3) reaching a policy of comparable quality in substantially less training steps than GRPO. We demonstrate consistent improvements across a range of tasks and model sizes.

GOPO: Policy Optimization using Ranked Rewards

TL;DR

This work tackles the unreliability of reward magnitudes in RLHF when rewards are non-verifiable and noisy. It proposes GOPO, a rank-based policy optimization method that discards reward magnitudes and uses within-prompt rank order, formalized by the rank-based advantage . Empirically, GOPO delivers higher training and validation rewards, improved LLM-as-judge win rates, and faster convergence than GRPO across TLDR, UltraChat, and IFEval with multiple base sizes. The analysis indicates that rank-based advantages can increase gradient norms for small but lead to robust, faster optimization and better calibration in practice. Overall, GOPO provides a simple, effective mechanism for RLHF in non-verifiable settings and suggests a shift toward ordinal information in policy updates.

Abstract

Standard reinforcement learning from human feedback (RLHF) trains a reward model on pairwise preference data and then uses it for policy optimization. However, while reward models are optimized to capture relative preferences, existing policy optimization techniques rely on absolute reward magnitudes during training. In settings where the rewards are non-verifiable such as summarization, instruction following, and chat completion, this misalignment often leads to suboptimal performance. We introduce Group Ordinal Policy Optimization (GOPO), a policy optimization method that uses only the ranking of the rewards and discards their magnitudes. Our rank-based transformation of rewards provides several gains, compared to Group Relative Policy Optimization (GRPO), in settings with non-verifiable rewards: (1) consistently higher training/validation reward trajectories, (2) improved LLM-as-judge evaluations across most intermediate training steps, and (3) reaching a policy of comparable quality in substantially less training steps than GRPO. We demonstrate consistent improvements across a range of tasks and model sizes.
Paper Structure (39 sections, 3 theorems, 27 equations, 10 figures, 2 tables)

This paper contains 39 sections, 3 theorems, 27 equations, 10 figures, 2 tables.

Key Result

Theorem 3.1

Let $\mathcal{F} = \sigma(q, A_1, \dots, A_G)$ be the conditioning event and define the centered vectors $\xi := g - \mathbb{E}[g\mid \mathcal{F}]$ and $\widetilde{X}_i := X_i - \mathbb{E}[X_i\mid \mathcal{F}]$ where $g = \nabla_\theta \mathcal{J}_1(\theta)$ is the gradient of the (non-penalized) ob

Figures (10)

  • Figure 1: GOPO vs. GRPO advantage transformations. For a fixed prompt with rewards $\{r_i\}$, GRPO uses a $z$-score transformation that centers and scales rewards within the group, while GOPO uses a rank-based transformation that retains only the ordering. $z$-score advantages preserve relative magnitudes among rewards (e.g., similar colors for $A_1, A_4, A_5$ reflect similar raw-reward affinities), whereas rank-based advantages discard scale and can assign different heat levels to rewards with similar magnitudes.
  • Figure 2: Base model: Qwen3-8B, Reward model: Skywork (Qwen3-8B), Task: TLDR. Figures (a) and (b) plot the per-training step policy's generation mean reward using prompts in the training dataset and validation dataset respectively---both rewards are consistently higher for GOPO updated policies throughout training. Figure (c) reports the LLM-as-judge win-rate (see Section \ref{['sec:eval']} on how the win-rate is defined) of GOPO updated policies against GRPO updated policies at matched training steps---for multi-seed generations, GOPO consistently improves the win-rates throughout all training steps. The policy generation temperature for Figure (c) is fixed at $0.5$; see Table \ref{['tab:qwen8b_combined']} in Section \ref{['sec:tldrchat']} for win-rates on varying temperatures. Lastly, validation reward of GRPO at its last training step is achieved earlier for GOPO (step $100$), and the GOPO win-rate at its earlier training step against the final GRPO is $0.52$.
  • Figure 3: Base model: Qwen3-4B, Reward model: Skywork (Qwen3-8B), Task: UltraChat. Figure (a) and (b) are the per-training step policy's generation mean reward using prompts in the training dataset and validation dataset respectively---both rewards are consistently higher for GOPO updated policies throughout training. Figure (c) contains the LLM-as-judge win-rate (see section \ref{['sec:eval']} on how the win-rate is defined) of GOPO updated policies against GRPO updated policies at their identical training steps---for multi-seed generations, GOPO consistently improves the win-rates throughout most of the training steps. The policy generation temperature for Figure (c) is fixed at $0.5$; see Table \ref{['tab:qwen_small_combined']} in Appendix \ref{['app:robust']} for results on varying temperatures. Lastly, validation reward of GRPO at its last training step is achieved earlier for GOPO (step $175$), and the GOPO win-rate at its earlier training step against the final GRPO is $0.52$.
  • Figure 4: Base model: Qwen3-1.7B, Reward model: Skywork (Qwen3-8B), Task: TLDR. Figure (a) and (b) are the per-training step policy's generation mean reward using prompts in the training dataset and validation dataset respectively. Both rewards are consistently higher for GOPO updated policies throughout training. Figure (c) contains the LLM-as-judge win-rate (see Section \ref{['sec:eval']} for how win-rate is defined) of GOPO updated policies against GRPO updated policies at their identical training steps---for multi-seed generations, GOPO consistently improves the win-rates throughout all training steps. The policy generation temperature for Figure (c) is fixed at $0.5$; see Table \ref{['tab:qwen_small_combined']} in Appendix \ref{['app:robust']} for results on varying temperatures. Lastly, validation reward of GRPO at its last training step is achieved earlier for GOPO (step $250$), and the GOPO win-rate at its earlier training step against the final GRPO is $0.52$.
  • Figure 5: Base model: Qwen3-1.7B, Reward model: Skywork (Qwen3-8B), Task: IFEval. Figure (a) and (b) plot the per-training step policy's generation mean reward using prompts in the training dataset and validation dataset respectively---both rewards are consistently higher for GOPO-updated policies throughout training. Figure (c) contains the best benchmark score (see Section \ref{['sec:eval']} for details) of GOPO-updated policies and GRPO-updated policies across multiple generation temperatures---GOPO achieves higher scores at earlier checkpoints.
  • ...and 5 more figures

Theorems & Definitions (5)

  • Theorem 3.1: Larger Gradient Norms
  • Remark 3.2: Connection to KL, Theorem \ref{['thm:B1-gopo-inflation']}
  • Lemma 3.1: Empirical Second Moment of Advantages
  • proof : Proof of Lemma \ref{['lemma:helper-advantage-bounds']}
  • Theorem 4.1: Bounded Gradient Norms