Table of Contents
Fetching ...

Discriminative Policy Optimization for Token-Level Reward Models

Hongzhan Chen, Tao Yang, Shiping Gao, Ruijun Chen, Xiaojun Quan, Hongtao Tian, Ting Yao

TL;DR

This work introduces the Q-function Reward Model (Q-RM), a token-level, discriminative-policy–based reward model that decouples reward estimation from language generation. By linking the optimal discriminative logits $Z^*(s_t,a_t)$ to the optimal Q-function via $\beta\log\phi^*(s_t,a_t)=Q^*(s_t,a_t)-V^*(s_t)$, the method derives token-level rewards and an efficient training objective suitable for PPO and REINFORCE. The authors prove a linear relationship between $Q^*$ and $Z^*$, reformulate the reward signals to be computation-friendly, and demonstrate substantial improvements over ORM and token-level PRMs across mathematical reasoning, machine reading comprehension, and instruction-following tasks, with dramatically faster convergence. Overall, Q-RM provides finer-grained credit assignment, reduces instability in token-level RLHF, and delivers practical, scalable gains in alignment of LLMs.

Abstract

Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks. Recent efforts have advanced PRMs from step-level to token-level granularity by integrating reward modeling into the training of generative models, with reward scores derived from token generation probabilities. However, the conflict between generative language modeling and reward modeling may introduce instability and lead to inaccurate credit assignments. To address this challenge, we revisit token-level reward assignment by decoupling reward modeling from language generation and derive a token-level reward model through the optimization of a discriminative policy, termed the Q-function Reward Model (Q-RM). We theoretically demonstrate that Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations. In our experiments, Q-RM consistently outperforms all baseline methods across various benchmarks. For example, when integrated into PPO/REINFORCE algorithms, Q-RM enhances the average Pass@1 score by 5.85/4.70 points on mathematical reasoning tasks compared to the ORM baseline, and by 4.56/5.73 points compared to the token-level PRM counterpart. Moreover, reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH. Code and data are available at https://github.com/homzer/Q-RM.

Discriminative Policy Optimization for Token-Level Reward Models

TL;DR

This work introduces the Q-function Reward Model (Q-RM), a token-level, discriminative-policy–based reward model that decouples reward estimation from language generation. By linking the optimal discriminative logits to the optimal Q-function via , the method derives token-level rewards and an efficient training objective suitable for PPO and REINFORCE. The authors prove a linear relationship between and , reformulate the reward signals to be computation-friendly, and demonstrate substantial improvements over ORM and token-level PRMs across mathematical reasoning, machine reading comprehension, and instruction-following tasks, with dramatically faster convergence. Overall, Q-RM provides finer-grained credit assignment, reduces instability in token-level RLHF, and delivers practical, scalable gains in alignment of LLMs.

Abstract

Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks. Recent efforts have advanced PRMs from step-level to token-level granularity by integrating reward modeling into the training of generative models, with reward scores derived from token generation probabilities. However, the conflict between generative language modeling and reward modeling may introduce instability and lead to inaccurate credit assignments. To address this challenge, we revisit token-level reward assignment by decoupling reward modeling from language generation and derive a token-level reward model through the optimization of a discriminative policy, termed the Q-function Reward Model (Q-RM). We theoretically demonstrate that Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations. In our experiments, Q-RM consistently outperforms all baseline methods across various benchmarks. For example, when integrated into PPO/REINFORCE algorithms, Q-RM enhances the average Pass@1 score by 5.85/4.70 points on mathematical reasoning tasks compared to the ORM baseline, and by 4.56/5.73 points compared to the token-level PRM counterpart. Moreover, reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH. Code and data are available at https://github.com/homzer/Q-RM.

Paper Structure

This paper contains 42 sections, 2 theorems, 39 equations, 16 figures, 4 tables, 2 algorithms.

Key Result

Proposition 3.2

Given a trajectory $\tau$ of length $T$. Let $\mathcal{V}(\tau)=\frac{1}{T}\sum^{T-1}_{t=0}\log\sum_{a_t\in\mathcal{A}}\exp(Z^*(s_t,a_t)-z_t)$ be the average logarithm of the adjusted partition function over $\tau$, where $Z^*(s_t,a_t)$ is logit of the discriminative policy $\phi^*$, and $z_t=\max_{ where $\mathcal{H}^*(\tau)=-\frac{1}{T}\sum^{T-1}_{t=0}\sum_{a_t\in\mathcal{A}}\phi^*(s_t,a_t)\log\

Figures (16)

  • Figure 1: Token credit assignment visualization of an example from GSM8K. We compare our method Q-RM with DPO-RM rafailov2024r on trajectories $\tau^l$ and $\tau^w$. DPO-RM tends to assign large rewards to line breaks in both chosen and rejected answers, while overlooking critical tokens (e.g., "$135", "$7", and "$133"). Q-RM, on the other hand, assigns high rewards to correct tokens (e.g., "0.05") while penalizing incorrect tokens (e.g., "$135"). These rewards are standardized to ensure a mean of 0 and a variance of 1.
  • Figure 2: (a) Comparison of alignment results across different model sizes. (b) Accuracy comparison of various RMs and the performance of their corresponding policies. (c) Pass@N performance across increasing sampling iterations $N$.
  • Figure 3: (a) Comparison of training efficiency between Q-RM and ORM by evaluating policy accuracy. (b) Comparison of Q-RM and step-level PRM trained on the PRM800K dataset.
  • Figure 4: Experiments on matching the sizes of the reward model and policy model. We report Pass@1 performance on GSM8K and MATH test sets. (a) Llama-3-8B-Instruct serves as the backbone. (b) Llama-3.1-8B-Instruct serves as the backbone.
  • Figure 5: Token reward distributions. We use GSM8K test set instructions to obtain policy responses, with Q-RM scoring each token. "Base" is the Llama-2-7B model, "SFT" is the Base model fine-tuned on task data, and "PPO+Q-RM" is the SFT model further trained with PPO using Q-RM. (a) Reward distributions of Base and SFT. (b) Reward distributions of SFT and PPO+Q-RM.
  • ...and 11 more figures

Theorems & Definitions (5)

  • Definition 3.1
  • Proposition 3.2
  • proof
  • Proposition 3.4
  • proof