GOPO: Policy Optimization using Ranked Rewards

Kyuseong Choi; Dwaipayan Saha; Woojeong Kim; Anish Agarwal; Raaz Dwivedi

GOPO: Policy Optimization using Ranked Rewards

Kyuseong Choi, Dwaipayan Saha, Woojeong Kim, Anish Agarwal, Raaz Dwivedi

TL;DR

This work tackles the unreliability of reward magnitudes in RLHF when rewards are non-verifiable and noisy. It proposes GOPO, a rank-based policy optimization method that discards reward magnitudes and uses within-prompt rank order, formalized by the rank-based advantage $\hat{A}_{i,t}^{\mathrm{rank}} := 2 - (\rho(i) - 1)\cdot \frac{4}{G - 1}$. Empirically, GOPO delivers higher training and validation rewards, improved LLM-as-judge win rates, and faster convergence than GRPO across TLDR, UltraChat, and IFEval with multiple base sizes. The analysis indicates that rank-based advantages can increase gradient norms for small $G$ but lead to robust, faster optimization and better calibration in practice. Overall, GOPO provides a simple, effective mechanism for RLHF in non-verifiable settings and suggests a shift toward ordinal information in policy updates.

Abstract

Standard reinforcement learning from human feedback (RLHF) trains a reward model on pairwise preference data and then uses it for policy optimization. However, while reward models are optimized to capture relative preferences, existing policy optimization techniques rely on absolute reward magnitudes during training. In settings where the rewards are non-verifiable such as summarization, instruction following, and chat completion, this misalignment often leads to suboptimal performance. We introduce Group Ordinal Policy Optimization (GOPO), a policy optimization method that uses only the ranking of the rewards and discards their magnitudes. Our rank-based transformation of rewards provides several gains, compared to Group Relative Policy Optimization (GRPO), in settings with non-verifiable rewards: (1) consistently higher training/validation reward trajectories, (2) improved LLM-as-judge evaluations across most intermediate training steps, and (3) reaching a policy of comparable quality in substantially less training steps than GRPO. We demonstrate consistent improvements across a range of tasks and model sizes.

GOPO: Policy Optimization using Ranked Rewards

TL;DR

. Empirically, GOPO delivers higher training and validation rewards, improved LLM-as-judge win rates, and faster convergence than GRPO across TLDR, UltraChat, and IFEval with multiple base sizes. The analysis indicates that rank-based advantages can increase gradient norms for small

but lead to robust, faster optimization and better calibration in practice. Overall, GOPO provides a simple, effective mechanism for RLHF in non-verifiable settings and suggests a shift toward ordinal information in policy updates.

Abstract

Paper Structure (39 sections, 3 theorems, 27 equations, 10 figures, 2 tables)

This paper contains 39 sections, 3 theorems, 27 equations, 10 figures, 2 tables.

Introduction
Related Work
RLHF and Policy Optimization for Language Models
Preference Learning and Ordinal Information
Variance Reduction, Robustness, and Advantage Design
Multi-Stage Post-Training
Method
Review of GRPO
Group Ordinal Policy Optimization (GOPO)
Why rank?
Gradient norms
Experimental Setup
Training
Models
KL-adjusted training steps
...and 24 more sections

Key Result

Theorem 3.1

Let $\mathcal{F} = \sigma(q, A_1, \dots, A_G)$ be the conditioning event and define the centered vectors $\xi := g - \mathbb{E}[g\mid \mathcal{F}]$ and $\widetilde{X}_i := X_i - \mathbb{E}[X_i\mid \mathcal{F}]$ where $g = \nabla_\theta \mathcal{J}_1(\theta)$ is the gradient of the (non-penalized) ob

Figures (10)

Figure 1: GOPO vs. GRPO advantage transformations. For a fixed prompt with rewards $\{r_i\}$, GRPO uses a $z$-score transformation that centers and scales rewards within the group, while GOPO uses a rank-based transformation that retains only the ordering. $z$-score advantages preserve relative magnitudes among rewards (e.g., similar colors for $A_1, A_4, A_5$ reflect similar raw-reward affinities), whereas rank-based advantages discard scale and can assign different heat levels to rewards with similar magnitudes.
Figure 2: Base model: Qwen3-8B, Reward model: Skywork (Qwen3-8B), Task: TLDR. Figures (a) and (b) plot the per-training step policy's generation mean reward using prompts in the training dataset and validation dataset respectively---both rewards are consistently higher for GOPO updated policies throughout training. Figure (c) reports the LLM-as-judge win-rate (see Section \ref{['sec:eval']} on how the win-rate is defined) of GOPO updated policies against GRPO updated policies at matched training steps---for multi-seed generations, GOPO consistently improves the win-rates throughout all training steps. The policy generation temperature for Figure (c) is fixed at $0.5$; see Table \ref{['tab:qwen8b_combined']} in Section \ref{['sec:tldrchat']} for win-rates on varying temperatures. Lastly, validation reward of GRPO at its last training step is achieved earlier for GOPO (step $100$), and the GOPO win-rate at its earlier training step against the final GRPO is $0.52$.
Figure 3: Base model: Qwen3-4B, Reward model: Skywork (Qwen3-8B), Task: UltraChat. Figure (a) and (b) are the per-training step policy's generation mean reward using prompts in the training dataset and validation dataset respectively---both rewards are consistently higher for GOPO updated policies throughout training. Figure (c) contains the LLM-as-judge win-rate (see section \ref{['sec:eval']} on how the win-rate is defined) of GOPO updated policies against GRPO updated policies at their identical training steps---for multi-seed generations, GOPO consistently improves the win-rates throughout most of the training steps. The policy generation temperature for Figure (c) is fixed at $0.5$; see Table \ref{['tab:qwen_small_combined']} in Appendix \ref{['app:robust']} for results on varying temperatures. Lastly, validation reward of GRPO at its last training step is achieved earlier for GOPO (step $175$), and the GOPO win-rate at its earlier training step against the final GRPO is $0.52$.
Figure 4: Base model: Qwen3-1.7B, Reward model: Skywork (Qwen3-8B), Task: TLDR. Figure (a) and (b) are the per-training step policy's generation mean reward using prompts in the training dataset and validation dataset respectively. Both rewards are consistently higher for GOPO updated policies throughout training. Figure (c) contains the LLM-as-judge win-rate (see Section \ref{['sec:eval']} for how win-rate is defined) of GOPO updated policies against GRPO updated policies at their identical training steps---for multi-seed generations, GOPO consistently improves the win-rates throughout all training steps. The policy generation temperature for Figure (c) is fixed at $0.5$; see Table \ref{['tab:qwen_small_combined']} in Appendix \ref{['app:robust']} for results on varying temperatures. Lastly, validation reward of GRPO at its last training step is achieved earlier for GOPO (step $250$), and the GOPO win-rate at its earlier training step against the final GRPO is $0.52$.
Figure 5: Base model: Qwen3-1.7B, Reward model: Skywork (Qwen3-8B), Task: IFEval. Figure (a) and (b) plot the per-training step policy's generation mean reward using prompts in the training dataset and validation dataset respectively---both rewards are consistently higher for GOPO-updated policies throughout training. Figure (c) contains the best benchmark score (see Section \ref{['sec:eval']} for details) of GOPO-updated policies and GRPO-updated policies across multiple generation temperatures---GOPO achieves higher scores at earlier checkpoints.
...and 5 more figures

Theorems & Definitions (5)

Theorem 3.1: Larger Gradient Norms
Remark 3.2: Connection to KL, Theorem \ref{['thm:B1-gopo-inflation']}
Lemma 3.1: Empirical Second Moment of Advantages
proof : Proof of Lemma \ref{['lemma:helper-advantage-bounds']}
Theorem 4.1: Bounded Gradient Norms

GOPO: Policy Optimization using Ranked Rewards

TL;DR

Abstract

GOPO: Policy Optimization using Ranked Rewards

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (5)