Table of Contents
Fetching ...

Why is Your Language Model a Poor Implicit Reward Model?

Noam Razin, Yong Lin, Jiarui Yao, Sanjeev Arora

TL;DR

This work analyzes why implicit reward models (IM-RMs) underperform explicit reward models (EX-RMs) in generalization, despite sharing training data and losses. The authors combine theory and extensive experiments to show that IM-RMs rely more on superficial token-level cues, making them vulnerable to token-level distribution shifts and sometimes in-distribution accuracy drops, while EX-RMs leverage structured hidden representations to generalize. They also demonstrate that the common generation–verification gap hypothesis is insufficient to explain the gap, by proving that verification does not imply generation and by validating this with a Hamiltonian-cycle task. Empirically, IM-RMs exhibit weaker token-level generalization but comparable or stronger performance under domain shifts, and EX-RMs consistently yield higher reward margins, highlighting how small design choices profoundly influence robustness and RLHF outcomes. The findings guide more robust reward-model design and motivate further study of implicit biases across reward-model families.

Abstract

Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Toward a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.

Why is Your Language Model a Poor Implicit Reward Model?

TL;DR

This work analyzes why implicit reward models (IM-RMs) underperform explicit reward models (EX-RMs) in generalization, despite sharing training data and losses. The authors combine theory and extensive experiments to show that IM-RMs rely more on superficial token-level cues, making them vulnerable to token-level distribution shifts and sometimes in-distribution accuracy drops, while EX-RMs leverage structured hidden representations to generalize. They also demonstrate that the common generation–verification gap hypothesis is insufficient to explain the gap, by proving that verification does not imply generation and by validating this with a Hamiltonian-cycle task. Empirically, IM-RMs exhibit weaker token-level generalization but comparable or stronger performance under domain shifts, and EX-RMs consistently yield higher reward margins, highlighting how small design choices profoundly influence robustness and RLHF outcomes. The findings guide more robust reward-model design and motivate further study of implicit biases across reward-model families.

Abstract

Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Toward a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.

Paper Structure

This paper contains 36 sections, 6 theorems, 72 equations, 10 figures, 10 tables.

Key Result

Theorem 1

Let ${ r_{\mathrm{IM}} }$ be the IM-RM induced by a distribution $\pi$ over token sequences, i.e., ${ r_{\mathrm{IM}} } ({\mathbf x}, {\mathbf y}) = \beta \brk{ \ln \pi ({\mathbf y} | {\mathbf x}) - \ln \pi_{\mathrm{ref}} ({\mathbf y} | {\mathbf x}) }$ for ${\mathbf x}, {\mathbf y} \in {\mathcal{V} That is, for all prompts, the probability of $\pi$ generating a correct response is greater than th

Figures (10)

  • Figure 1: Explicit vs implicit reward models. To compute the reward for a prompt-response pair $({\mathbf x}, {\mathbf y})$, an EX-RM applies a linear head to the hidden representation that the language model $\pi_\theta$ produces for $({\mathbf x}, {\mathbf y})$. In contrast, the reward of an IM-RM is implicitly defined by $\pi_\theta$ through ${\beta \ln \frac{ \pi_\theta ({\mathbf y} | {\mathbf x}) }{ \pi_{\mathrm{ref}} ({\mathbf y} | {\mathbf x}) } }$, where $\beta \in {\mathbb R}_{> 0}$ is a fixed coefficient and $\pi_{\mathrm{ref}}$ is a reference distribution ( cf.rafailov2023direct).
  • Figure 2: IM-RMs are less robust than EX-RMs to token-level distribution shifts, but perform comparably or better under domain shifts. We trained EX-RMs and IM-RMs on UltraFeedback cui2024ultrafeedback, using the same initial language models, and evaluated their accuracy in-distribution (UltraFeedback test set), under token-level shifts (three UltraFeedback variants, in which responses were either paraphrased or translated to another language), and under domain shifts (two math and one code datasets). Reported are the win-rates, i.e., the percentage of evaluations in which either the EX-RM or IM-RM achieved a higher accuracy. If the accuracies were within $1\%$ of each other, we considered it a tie. The experiment included three random seeds per configuration and six language models: Gemma-2-2B-IT team2024gemma, Qwen-2.5-1.5B-Instruct, Qwen-2.5-3B-Instruct qwen2024technicalreport, Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct, and Llama-3.1-8B-Instruct dubey2024llama. See \ref{['sec:empirical:real_world']} for additional details.
  • Figure 3: Learning to verify with IM-RMs does not require learning to generate. We trained EX-RMs and IM-RMs to solve a Hamiltonian cycle verification task, based on the Pythia-1B language model. Each prompt in the dataset describes an undirected graph and the chosen and rejected responses are permutations of vertices. The chosen responses form Hamiltonian cycles in their respective graphs, while the rejected responses do not (see \ref{['app:experiments_details:ham_cycle']} for further details). In accordance with our theory (\ref{['sec:gen_verification:theory']}), although IM-RMs are unable to generate even a single correct Hamiltonian cycle for graphs in the training or test sets, they accurately distinguish between chosen and rejected responses, slightly outperforming EX-RMs. Values in the table are means across three random seeds (standard deviation was under $0.008$ in all cases).
  • Figure 4: IM-RMs fail to generalize to a simple token-level distribution shift, while EX-RMs generalize perfectly. We trained EX-RMs and IM-RMs on prompts from the Persona dataset perez2022discovering. Chosen responses expressed agreement with the prompts, whereas rejected responses expressed disagreement. During evaluation, we included paraphrased versions of the original responses (figure includes exemplar responses). In line with our analysis (\ref{['sec:analysis']}), IM-RMs are extremely inaccurate over paraphrased responses, whereas EX-RMs achieve perfect accuracy. The experiments were based on four language models: Pythia-1B, Qwen-2.5-1.5B-Instruct, Llama-3.2-1B, and Llama-3.2-1B-Instruct. Values in the table are means across the models and three random seeds (standard deviation was below $0.04$ in all cases).
  • Figure 5: IM-RMs are less robust than EX-RMs to token-level distribution shifts, but perform comparably or better under domain shifts. This figure presents the results of an experiment identical to that of \ref{['fig:ex_vs_im_rm_generalization_uf']}, except that the reward models were trained on the RewardMATH dataset instead of UltraFeedback. Accordingly, the math subset of RewardBench poses a token-level shift while UltraFeedback variants and the code subset of RewardBench pose a domain shift. Note that, in this setting, EX-RMs and IM-RMs perform similarly in-distribution since both reach near-maximal accuracy (see \ref{['table:ex_vs_im_rm_acc_and_margin']}).
  • ...and 5 more figures

Theorems & Definitions (14)

  • Definition 1
  • Definition 2
  • Theorem 1
  • proof : Proof sketch (full proof in \ref{['app:proofs:im_rm_verifier_does_not_imply_generator']})
  • Definition 3
  • Corollary 1
  • Theorem 2
  • proof : Proof sketch (full proof in \ref{['app:proofs:ex_rm_im_rm_generalization_gap']})
  • Lemma 1
  • proof
  • ...and 4 more