GOPO: Policy Optimization using Ranked Rewards
Kyuseong Choi, Dwaipayan Saha, Woojeong Kim, Anish Agarwal, Raaz Dwivedi
TL;DR
This work tackles the unreliability of reward magnitudes in RLHF when rewards are non-verifiable and noisy. It proposes GOPO, a rank-based policy optimization method that discards reward magnitudes and uses within-prompt rank order, formalized by the rank-based advantage $\hat{A}_{i,t}^{\mathrm{rank}} := 2 - (\rho(i) - 1)\cdot \frac{4}{G - 1}$. Empirically, GOPO delivers higher training and validation rewards, improved LLM-as-judge win rates, and faster convergence than GRPO across TLDR, UltraChat, and IFEval with multiple base sizes. The analysis indicates that rank-based advantages can increase gradient norms for small $G$ but lead to robust, faster optimization and better calibration in practice. Overall, GOPO provides a simple, effective mechanism for RLHF in non-verifiable settings and suggests a shift toward ordinal information in policy updates.
Abstract
Standard reinforcement learning from human feedback (RLHF) trains a reward model on pairwise preference data and then uses it for policy optimization. However, while reward models are optimized to capture relative preferences, existing policy optimization techniques rely on absolute reward magnitudes during training. In settings where the rewards are non-verifiable such as summarization, instruction following, and chat completion, this misalignment often leads to suboptimal performance. We introduce Group Ordinal Policy Optimization (GOPO), a policy optimization method that uses only the ranking of the rewards and discards their magnitudes. Our rank-based transformation of rewards provides several gains, compared to Group Relative Policy Optimization (GRPO), in settings with non-verifiable rewards: (1) consistently higher training/validation reward trajectories, (2) improved LLM-as-judge evaluations across most intermediate training steps, and (3) reaching a policy of comparable quality in substantially less training steps than GRPO. We demonstrate consistent improvements across a range of tasks and model sizes.
