Table of Contents
Fetching ...

Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model

Yuzhong Hong, Hanshan Zhang, Junwei Bao, Hongfei Jiang, Yang Song

TL;DR

This work identifies a fundamental flaw in offline RLHF using Bradley-Terry-based formulations: the MLE can be non-unique in the infinite space of responses, preventing the required slope-1 relationship between learned log-ratio rewards and true rewards. To address this, the authors introduce the Infinite Preference Model (IPM), an Energy-Based Model with a guaranteed unique MLE, and develop Energy Preference Alignment (EPA), a practical contrastive loss that approximates the IPM MLE using offline data. Theoretical results tie the IPM MLE to the RLHF minimizer under slope-1 linearity, while an energy-discrepancy-based offline scheme enables tractable training. Empirically, EPA delivers state-of-the-art offline alignment on open benchmarks, outperforming DPO and related baselines, with a favorable KL-reward tradeoff and beneficial effects from combining strong and weak negatives. The work highlights the potential of EBMs for offline RLHF and points to future work on efficiency and loss-trick design to further close the gap with online RL methods.

Abstract

Since the debut of DPO, it has been shown that aligning a target LLM with human preferences via the KL-constrained RLHF loss is mathematically equivalent to a special kind of reward modeling task. Concretely, the task requires: 1) using the target LLM to parameterize the reward model, and 2) tuning the reward model so that it has a 1:1 linear relationship with the true reward. However, we identify a significant issue: the DPO loss might have multiple minimizers, of which only one satisfies the required linearity condition. The problem arises from a well-known issue of the underlying Bradley-Terry preference model: it does not always have a unique maximum likelihood estimator (MLE). Consequently,the minimizer of the RLHF loss might be unattainable because it is merely one among many minimizers of the DPO loss. As a better alternative, we propose an energy-based model (EBM) that always has a unique MLE, inherently satisfying the linearity requirement. To approximate the MLE in practice, we propose a contrastive loss named Energy Preference Alignment (EPA), wherein each positive sample is contrasted against one or more strong negatives as well as many free weak negatives. Theoretical properties of our EBM enable the approximation error of EPA to almost surely vanish when a sufficient number of negatives are used. Empirically, we demonstrate that EPA consistently delivers better performance on open benchmarks compared to DPO, thereby showing the superiority of our EBM.

Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model

TL;DR

This work identifies a fundamental flaw in offline RLHF using Bradley-Terry-based formulations: the MLE can be non-unique in the infinite space of responses, preventing the required slope-1 relationship between learned log-ratio rewards and true rewards. To address this, the authors introduce the Infinite Preference Model (IPM), an Energy-Based Model with a guaranteed unique MLE, and develop Energy Preference Alignment (EPA), a practical contrastive loss that approximates the IPM MLE using offline data. Theoretical results tie the IPM MLE to the RLHF minimizer under slope-1 linearity, while an energy-discrepancy-based offline scheme enables tractable training. Empirically, EPA delivers state-of-the-art offline alignment on open benchmarks, outperforming DPO and related baselines, with a favorable KL-reward tradeoff and beneficial effects from combining strong and weak negatives. The work highlights the potential of EBMs for offline RLHF and points to future work on efficiency and loss-trick design to further close the gap with online RL methods.

Abstract

Since the debut of DPO, it has been shown that aligning a target LLM with human preferences via the KL-constrained RLHF loss is mathematically equivalent to a special kind of reward modeling task. Concretely, the task requires: 1) using the target LLM to parameterize the reward model, and 2) tuning the reward model so that it has a 1:1 linear relationship with the true reward. However, we identify a significant issue: the DPO loss might have multiple minimizers, of which only one satisfies the required linearity condition. The problem arises from a well-known issue of the underlying Bradley-Terry preference model: it does not always have a unique maximum likelihood estimator (MLE). Consequently,the minimizer of the RLHF loss might be unattainable because it is merely one among many minimizers of the DPO loss. As a better alternative, we propose an energy-based model (EBM) that always has a unique MLE, inherently satisfying the linearity requirement. To approximate the MLE in practice, we propose a contrastive loss named Energy Preference Alignment (EPA), wherein each positive sample is contrasted against one or more strong negatives as well as many free weak negatives. Theoretical properties of our EBM enable the approximation error of EPA to almost surely vanish when a sufficient number of negatives are used. Empirically, we demonstrate that EPA consistently delivers better performance on open benchmarks compared to DPO, thereby showing the superiority of our EBM.

Paper Structure

This paper contains 40 sections, 9 theorems, 40 equations, 6 figures, 7 tables.

Key Result

Theorem 3.1

when we parameterize the IPM as follows, the unique existence of the IPM's MLE is guaranteed and it will be reached if and only if the slope-1 linearity (i.e., Eq.(3)) holds between the log ratio reward and the true reward. where $r_{\theta}$ is defined as in Eq.(2).

Figures (6)

  • Figure 1: Samples are off from the slope-1 linearity (yellow lines) after training with DPO. Given an extremely undesirable $y_{weak}^{-}$ (i.e., it has very small $r_{\text{true}}$), its $r_{\theta}$ has to be as sufficiently small as $r_{\text{true}}$ to attain the linearity.
  • Figure 2: An illustration of the contributions of the paper. Our core argument is that an Energy-Based model (EBM) is a better alternative to the Bradley-Terry model (BTM) due to its guaranteed unique existence of maximum likelihood estimator (MLE) (which is identical to the minimizer of the RLHF loss). The advantage of our EBM comes from its intrinsic consideration of the infinity in the size of the space of $y|x$, whereas BTM ignores issues caused by the pair sampling distribution ($p(y_w,y_l|x)$) in such infinite space. Hence we name our EBM the Infinite Preference Model. Although approximating the MLE with our proposed EPA loss introduces inevitable error in practice, we find that it is still empirically better performing than its counterpart -- DPO, with or without loss modification techniques presented in previous offline alignment literatures.
  • Figure 3: DPO vs. EPA (1:1:2) from the perspective of (a) KL-Reward frontier and (b) training dynamics.
  • Figure 4: Performance of modified DPO ($N^{-}_{weak}=0$) and modified EPA ($N^{-}_{weak}>0$) with a margin $m_c$ added to $r_{\theta}(y_l|x)$. Solid lines represent the length-controlled win-rates, and dotted lines represent the raw win-rates.
  • Figure 5: A simplified illustration of the topology of the functionals mentioned in the paper's theorems. The vertical axis represents the value of each functional. The horizontal axis represents the space of $r$.
  • ...and 1 more figures

Theorems & Definitions (18)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Lemma 1.1
  • proof
  • Definition 1.2
  • Theorem 1.3: Theorem of necessity
  • proof
  • Theorem 1.4: Theorem of sufficiency
  • proof
  • ...and 8 more