Table of Contents
Fetching ...

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

Simon Matrenok, Skander Moalla, Caglar Gulcehre

TL;DR

<3-5 sentence high-level summary> QRPO introduces quantile Reward Policy Optimization, a framework that enables learning from pointwise absolute rewards in offline or off-policy settings by transforming rewards into a quantile form with a tractable partition function. By deriving a canonical regression objective and showing how the partition function can be expressed analytically or via a simple integral, QRPO enables efficient, sample-efficient fitting of the KL-regularized RL objective without relying on preferences. The method demonstrates strong empirical performance on general chat and coding tasks, scales with pre-computation budgets, and mitigates length bias compared to baseline preference-based methods. Overall, QRPO broadens the applicability of policy fitting to absolute rewards and offers a principled, scalable path for offline alignment of LLMs.

Abstract

Aligning large language models with pointwise absolute rewards has so far required online, on-policy algorithms such as PPO and GRPO. In contrast, simpler methods that can leverage offline or off-policy data, such as DPO and REBEL, are limited to learning from preference pairs or relative signals. To bridge this gap, we introduce Quantile Reward Policy Optimization (QRPO), which learns from pointwise absolute rewards while preserving the simplicity and offline applicability of DPO-like methods. QRPO uses quantile rewards to enable regression to the closed-form solution of the KL-regularized RL objective. This reward yields an analytically tractable partition function, removing the need for relative signals to cancel this term. Moreover, QRPO scales with increased compute to estimate quantile rewards, opening a new dimension for pre-computation scaling. Empirically, QRPO consistently achieves top performance on chat and coding evaluations--reward model scores, AlpacaEval 2, and LeetCode--compared to DPO, REBEL, and SimPO across diverse datasets and 8B-scale models. Finally, we find that training with robust rewards instead of converting them to preferences induces less length bias.

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

TL;DR

<3-5 sentence high-level summary> QRPO introduces quantile Reward Policy Optimization, a framework that enables learning from pointwise absolute rewards in offline or off-policy settings by transforming rewards into a quantile form with a tractable partition function. By deriving a canonical regression objective and showing how the partition function can be expressed analytically or via a simple integral, QRPO enables efficient, sample-efficient fitting of the KL-regularized RL objective without relying on preferences. The method demonstrates strong empirical performance on general chat and coding tasks, scales with pre-computation budgets, and mitigates length bias compared to baseline preference-based methods. Overall, QRPO broadens the applicability of policy fitting to absolute rewards and offers a principled, scalable path for offline alignment of LLMs.

Abstract

Aligning large language models with pointwise absolute rewards has so far required online, on-policy algorithms such as PPO and GRPO. In contrast, simpler methods that can leverage offline or off-policy data, such as DPO and REBEL, are limited to learning from preference pairs or relative signals. To bridge this gap, we introduce Quantile Reward Policy Optimization (QRPO), which learns from pointwise absolute rewards while preserving the simplicity and offline applicability of DPO-like methods. QRPO uses quantile rewards to enable regression to the closed-form solution of the KL-regularized RL objective. This reward yields an analytically tractable partition function, removing the need for relative signals to cancel this term. Moreover, QRPO scales with increased compute to estimate quantile rewards, opening a new dimension for pre-computation scaling. Empirically, QRPO consistently achieves top performance on chat and coding evaluations--reward model scores, AlpacaEval 2, and LeetCode--compared to DPO, REBEL, and SimPO across diverse datasets and 8B-scale models. Finally, we find that training with robust rewards instead of converting them to preferences induces less length bias.

Paper Structure

This paper contains 91 sections, 2 theorems, 100 equations, 15 figures, 16 tables, 1 algorithm.

Key Result

Theorem 1

Consider the Laplace integral Let $I=[a,b]$ be a finite interval and assume that Then, as $\lambda\to\infty$, with coefficients Moreover, this expansion may be differentiated with respect to $\lambda$ arbitrarily many times.

Figures (15)

  • Figure 1: QRPO outperforms policy fitting methods that learn from relative signals in downstream general chat tasks.
  • Figure 2: QRPO uses quantile rewards, which makes the exact expression of the partition function $Z$ tractable and allows fitting the solution of the KL-regularized objective with a pointwise regression, i.e., using a single sample with its reward instead of a pair of preferences. In the pre-computation phase, we generate reference completions from the reference model and compute their rewards. For training, we then use these reference rewards to compute the quantile reward of training samples. In its simplest form, QRPO optimizes the quantile reward; however, a family of transformations can be applied on top of the quantile reward to recover a desired reward shape, while still having a tractable exact expression for $Z$ (see Table \ref{['tab:transforms']}).
  • Figure 3: Scaling Performance of QRPO with a varying number of reference rewards generated during the pre-computation phase to estimate quantile rewards, in off-policy and offline distribution shifts with different values of KL regularization ($\beta$) for Llama on Magpie-Air with ArmoRM. We report the average online test reward with error bars indicating the standard deviation over three seeds. In the off-policy case (where training samples closely match the reference model's distribution, here generated by the reference model itself), performance steadily scales with more reference rewards, and is cost-effective for higher regularization ($\beta=0.1$), typically used in large post-training pipelines. In contrast, in the offline case (high distribution shift due to training samples from a more performant model in this case) QRPO achieves near-optimal performance with very few reference rewards and has minimal gains from additional pre-computation, which may be more cost-effective for lower regularization ($\beta=0.003$). This shows QRPO's pre-computation scalability and effectiveness in both scenarios.
  • Figure 4: Length bias Completion length difference from the initial checkpoint vs. the implicit reward of test completions generated by the best Llama model trained on Magpie-Air for each algorithm. Implicit rewards (colloquially, DPO rewards) are induced by the policy, which is optimal for these rewards according to the RL fine-tuning objective. We report a linear fit with a marker indicating the average completion length and Spearman rank correlation. SimPO reduces the average completion length compared to DPO (cloud shifted to the left), but its policy still exhibits as much length bias as the DPO policy. In contrast, QRPO and REBEL, which use the reward signal, do not exhibit a length bias trend.
  • Figure 5: Distribution of the initial reward $\mathcal{R}$ for a given prompt under the reference ($\pi_{ref}$) and optimal ($\pi^*$) policy distributions for different values of $\beta$. The optimal policy is maximizing the RL fine-tuning objective (Equation \ref{['eq:objective']}) with the quantile reward $\mathcal{R}_q$. In this plot, the reference reward distribution is assumed to be Gaussian, which can serve as a good example for rewards obtained from a reward model in a general chat task. With this assumption, we can compute both the reference and optimal reward distributions analytically. Refer to the final paragraph for the derivation. Left: Gradient update direction for samples with different reward values. The quantile reward $\mathcal{R}_q = \beta \log Z_q$ corresponds to the reward $\mathcal{R}$ at the intersection point of the densities. This value plays the role of a threshold, which separates the samples with rewards below the threshold that should have their probability decreased, and samples with rewards above the threshold that should have their probability increased. Right: Position of the optimal policy reward distribution for different values of $\beta$. A smaller $\beta$ leads to a larger target distribution shift, resulting in a gradient that decreases the probability of the majority of samples around the reference policy (see left plot for the intuition).
  • ...and 10 more figures

Theorems & Definitions (2)

  • Theorem : Laplace’s method for an endpoint maximum
  • Theorem : Laplace’s method for an endpoint maximum