Table of Contents
Fetching ...

When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models

Tong Xie, Andrew Bai, Yuanhao Ban, Yunqi Hong, Haoyu Li, Cho-jui Hsieh

TL;DR

The Bradley-Terry loss, though standard for reward modeling in RLHF, exhibits a gradient bias where update strength depends on both prediction error and the representation distance between response pairs. This distorts learning, especially for small-distance, fine-grained distinctions. The authors propose NormBT, a lightweight per-pair normalization using a representation-distance proxy with EMA stabilization, to align updates with prediction error. Across two backbones and two datasets, NormBT consistently improves RewardBench performance, with notable gains in Reasoning; BoN evaluations also show higher gold scores. This work provides a simple, effective fix to a fundamental issue in BT-based reward modeling, enhancing LM alignment with human preferences.

Abstract

Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward modeling is the Bradley-Terry (BT) loss, which learns from pairwise data consisting of a pair of chosen and rejected responses. In this work, we analyze the per-sample gradient of BT-loss and show that its norm scales with two distinct components: (1) the difference in predicted rewards between chosen and rejected responses, which reflects the prediction error, and critically, (2) representation distance between the pair measured in the output space of the final layer. While the first term captures the intended training signal, we show that the second term can significantly impact the update magnitude and misalign learning. Specifically, pairs with small representation distance often receive vanishingly weak updates, even when misranked, while pairs with large distance receive disproportionately strong updates. This leads to gradients from large-distance pairs to overshadow those from small-distance pairs, where fine-grained distinctions are especially important. To overcome this limitation, we propose NormBT, an adaptive pair-wise normalization scheme that balances representation-driven effects and focuses learning signals on prediction error. NormBT is a lightweight, drop-in integration to BT loss with negligible overhead. Across various LLM backbones and datasets, NormBT improves reward model performance consistently, with notable gains of over 5% on the Reasoning category of RewardBench, which contains numerous small-distance pairs. This work reveals a key limitation in the widely used BT objective and provides a simple, effective correction.

When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models

TL;DR

The Bradley-Terry loss, though standard for reward modeling in RLHF, exhibits a gradient bias where update strength depends on both prediction error and the representation distance between response pairs. This distorts learning, especially for small-distance, fine-grained distinctions. The authors propose NormBT, a lightweight per-pair normalization using a representation-distance proxy with EMA stabilization, to align updates with prediction error. Across two backbones and two datasets, NormBT consistently improves RewardBench performance, with notable gains in Reasoning; BoN evaluations also show higher gold scores. This work provides a simple, effective fix to a fundamental issue in BT-based reward modeling, enhancing LM alignment with human preferences.

Abstract

Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward modeling is the Bradley-Terry (BT) loss, which learns from pairwise data consisting of a pair of chosen and rejected responses. In this work, we analyze the per-sample gradient of BT-loss and show that its norm scales with two distinct components: (1) the difference in predicted rewards between chosen and rejected responses, which reflects the prediction error, and critically, (2) representation distance between the pair measured in the output space of the final layer. While the first term captures the intended training signal, we show that the second term can significantly impact the update magnitude and misalign learning. Specifically, pairs with small representation distance often receive vanishingly weak updates, even when misranked, while pairs with large distance receive disproportionately strong updates. This leads to gradients from large-distance pairs to overshadow those from small-distance pairs, where fine-grained distinctions are especially important. To overcome this limitation, we propose NormBT, an adaptive pair-wise normalization scheme that balances representation-driven effects and focuses learning signals on prediction error. NormBT is a lightweight, drop-in integration to BT loss with negligible overhead. Across various LLM backbones and datasets, NormBT improves reward model performance consistently, with notable gains of over 5% on the Reasoning category of RewardBench, which contains numerous small-distance pairs. This work reveals a key limitation in the widely used BT objective and provides a simple, effective correction.

Paper Structure

This paper contains 36 sections, 13 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: An illustration where a pair (top) receives inherently large update due to large representation distance, and a pair (bottom) receive weak update due to small distance
  • Figure 2: Gradient Information Across Dataset. Gradient norms (thus update sizes) vary widely by task under the BT-loss, corresponding to the variation in representation distance on the right. In particular, Reasoning pairs exhibit the smallest distance and correspondingly the weakest updates.
  • Figure 3: Update sizes at fixed reward difference.
  • Figure 4: Comparison on RewardBench pairs where two models disagree. Pairs are binned by representation distance $\| h_w-h_l \|$ computed from gemma-2B-it. The largest gains for NormBT appear in the small-distance regime, consistent with our analysis that BT under-updates such pairs.
  • Figure 5: Best-of-N Selection. Higher gold score indicates higher quality of the selected responses. NormBT consistently outperforms all BT baselines, as well as BT counterparts in ablation studies.
  • ...and 6 more figures