Table of Contents
Fetching ...

Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

Abhijnan Nath, Changsoo Jung, Ethan Seefried, Nikhil Krishnaswamy

TL;DR

The paper tackles misalignment between rewards and human preferences in offline LLM alignment, especially under non-deterministic judgments. It introduces Direct Reward Distillation and policy-Optimization (DRDO), a simple, efficient, non-ensemble, reference-free framework that distills rewards from an Oracle into the policy while simultaneously learning diverse preferences via a novel contrastive log-unlikelihood objective with a focal modulation. DRDO eliminates reliance on a reference-model KL constraint and demonstrates improved expected rewards and robustness to noisy preferences and out-of-distribution data on Ultrafeedback and TL;DR datasets, across model sizes. Empirically, DRDO outperforms DPO and e-DPO in both deterministic and non-deterministic settings and maintains performance when preference signals are weak or noisy, highlighting its practical impact for scalable, robust preference alignment. This work provides a principled, computationally efficient path to joint reward distillation and preference learning, enabling more reliable alignment of LLMs in real-world, imperfect data regimes, with explicit handling of non-deterministic signals through a novel modulated distillation objective.

Abstract

Traditional RLHF-based LLM alignment methods explicitly maximize the expected rewards from a separate reward model. More recent supervised alignment methods like Direct Preference Optimization (DPO) circumvent this phase to avoid problems including model drift and reward overfitting. Although popular due to its simplicity, DPO and similar direct alignment methods which rely heavily on the Bradley-Terry-based pairwise preference formulation can still lead to degenerate policies when challenged by non-deterministic or noisy preference labels, for example human scoring of two candidate outputs with low confidence. This paper introduces DRDO (Direct Reward Distillation and policy-Optimization), which simultaneously models rewards and preferences to avoid such degeneracy. DRDO directly mimics rewards assigned by an oracle while learning human preferences with a novel preference likelihood formulation. Results on the Ultrafeedback and TL;DR datasets demonstrate that DRDO-trained policies surpass methods such as DPO and e-DPO in terms of expected rewards and are more robust, on average, to noisy preference signals as well as out-of-distribution (OOD) settings.

Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

TL;DR

The paper tackles misalignment between rewards and human preferences in offline LLM alignment, especially under non-deterministic judgments. It introduces Direct Reward Distillation and policy-Optimization (DRDO), a simple, efficient, non-ensemble, reference-free framework that distills rewards from an Oracle into the policy while simultaneously learning diverse preferences via a novel contrastive log-unlikelihood objective with a focal modulation. DRDO eliminates reliance on a reference-model KL constraint and demonstrates improved expected rewards and robustness to noisy preferences and out-of-distribution data on Ultrafeedback and TL;DR datasets, across model sizes. Empirically, DRDO outperforms DPO and e-DPO in both deterministic and non-deterministic settings and maintains performance when preference signals are weak or noisy, highlighting its practical impact for scalable, robust preference alignment. This work provides a principled, computationally efficient path to joint reward distillation and preference learning, enabling more reliable alignment of LLMs in real-world, imperfect data regimes, with explicit handling of non-deterministic signals through a novel modulated distillation objective.

Abstract

Traditional RLHF-based LLM alignment methods explicitly maximize the expected rewards from a separate reward model. More recent supervised alignment methods like Direct Preference Optimization (DPO) circumvent this phase to avoid problems including model drift and reward overfitting. Although popular due to its simplicity, DPO and similar direct alignment methods which rely heavily on the Bradley-Terry-based pairwise preference formulation can still lead to degenerate policies when challenged by non-deterministic or noisy preference labels, for example human scoring of two candidate outputs with low confidence. This paper introduces DRDO (Direct Reward Distillation and policy-Optimization), which simultaneously models rewards and preferences to avoid such degeneracy. DRDO directly mimics rewards assigned by an oracle while learning human preferences with a novel preference likelihood formulation. Results on the Ultrafeedback and TL;DR datasets demonstrate that DRDO-trained policies surpass methods such as DPO and e-DPO in terms of expected rewards and are more robust, on average, to noisy preference signals as well as out-of-distribution (OOD) settings.

Paper Structure

This paper contains 44 sections, 4 theorems, 37 equations, 9 figures, 8 tables.

Key Result

Proposition 1

Consider a preference model $\mathcal{P}$ that cannot be perfectly captured by a Bradley-Terry (BT) reward model. This is illustrated in proof:reward_preference_divergence. Specifically, let $\mathcal{P}_{BT}^\pi(y\succ y')\coloneqq \sigma(s^\pi(y) - s^\pi(y'))$ be the BT preference model correspond

Figures (9)

  • Figure 1: Unlike popular supervised preference alignment algorithms like Direct Preference Optimization (DPO; rafailov2024direct) that learns rewards implicitly, DRDO directly optimizes for explicit rewards from an Oracle while simultaneously learning diverse kinds of preference signals during alignment. Optimized with a simple regression loss based on difference of rewards assigned by the Oracle and the introduction of a focal-log-unlikelihood component (see Sec. \ref{['sec:drdo']}), DRDO avoids DPO's particular challenges at learning non-deterministic preference pairs, thereby bridging the gap between the preference distribution estimated from the data and the true preference distribution $p^*$. Additionally, DRDO does not require an additional reference model during training and can leverage reward signals even when preference labels are not directly accessible.
  • Figure 2: Average Ultrafeedback win-rates computed with DRDO's Oracle reward model against all baselines---SFT, DPO and e-DPO---at various diversity sampling temperatures ($T$).
  • Figure 3: Oracle expected reward advantage on CNN/Daily Mail articles.
  • Figure 4: Illustration of the DRDO preference loss as a function of the log-unlikelihood ratio across various values of $\gamma$, the focal modulation parameter.
  • Figure 5: Top: DRDO performance evolution during OPT 1.3B training compared to DPO and e-DPO on the evaluation set of Ultrafeedback cui2024ultrafeedbackboostinglanguagemodels, and randomly sampled generations to compute the reward advantage against the preferred reference generations.
  • ...and 4 more figures

Theorems & Definitions (9)

  • Proposition 1: Sampling Distribution Dependence in the Induced Bradley-Terry (BT) Model
  • proof
  • Proposition 2: Non-Deterministic Preferences
  • Lemma 1
  • Lemma 2
  • proof
  • proof
  • proof
  • proof