Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

Simon Matrenok; Skander Moalla; Caglar Gulcehre

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

Simon Matrenok, Skander Moalla, Caglar Gulcehre

TL;DR

<3-5 sentence high-level summary> QRPO introduces quantile Reward Policy Optimization, a framework that enables learning from pointwise absolute rewards in offline or off-policy settings by transforming rewards into a quantile form with a tractable partition function. By deriving a canonical regression objective and showing how the partition function can be expressed analytically or via a simple integral, QRPO enables efficient, sample-efficient fitting of the KL-regularized RL objective without relying on preferences. The method demonstrates strong empirical performance on general chat and coding tasks, scales with pre-computation budgets, and mitigates length bias compared to baseline preference-based methods. Overall, QRPO broadens the applicability of policy fitting to absolute rewards and offers a principled, scalable path for offline alignment of LLMs.

Abstract

Aligning large language models with pointwise absolute rewards has so far required online, on-policy algorithms such as PPO and GRPO. In contrast, simpler methods that can leverage offline or off-policy data, such as DPO and REBEL, are limited to learning from preference pairs or relative signals. To bridge this gap, we introduce Quantile Reward Policy Optimization (QRPO), which learns from pointwise absolute rewards while preserving the simplicity and offline applicability of DPO-like methods. QRPO uses quantile rewards to enable regression to the closed-form solution of the KL-regularized RL objective. This reward yields an analytically tractable partition function, removing the need for relative signals to cancel this term. Moreover, QRPO scales with increased compute to estimate quantile rewards, opening a new dimension for pre-computation scaling. Empirically, QRPO consistently achieves top performance on chat and coding evaluations--reward model scores, AlpacaEval 2, and LeetCode--compared to DPO, REBEL, and SimPO across diverse datasets and 8B-scale models. Finally, we find that training with robust rewards instead of converting them to preferences induces less length bias.

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

TL;DR

Abstract

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (2)