Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both
Abhijnan Nath, Changsoo Jung, Ethan Seefried, Nikhil Krishnaswamy
TL;DR
The paper tackles misalignment between rewards and human preferences in offline LLM alignment, especially under non-deterministic judgments. It introduces Direct Reward Distillation and policy-Optimization (DRDO), a simple, efficient, non-ensemble, reference-free framework that distills rewards from an Oracle into the policy while simultaneously learning diverse preferences via a novel contrastive log-unlikelihood objective with a focal modulation. DRDO eliminates reliance on a reference-model KL constraint and demonstrates improved expected rewards and robustness to noisy preferences and out-of-distribution data on Ultrafeedback and TL;DR datasets, across model sizes. Empirically, DRDO outperforms DPO and e-DPO in both deterministic and non-deterministic settings and maintains performance when preference signals are weak or noisy, highlighting its practical impact for scalable, robust preference alignment. This work provides a principled, computationally efficient path to joint reward distillation and preference learning, enabling more reliable alignment of LLMs in real-world, imperfect data regimes, with explicit handling of non-deterministic signals through a novel modulated distillation objective.
Abstract
Traditional RLHF-based LLM alignment methods explicitly maximize the expected rewards from a separate reward model. More recent supervised alignment methods like Direct Preference Optimization (DPO) circumvent this phase to avoid problems including model drift and reward overfitting. Although popular due to its simplicity, DPO and similar direct alignment methods which rely heavily on the Bradley-Terry-based pairwise preference formulation can still lead to degenerate policies when challenged by non-deterministic or noisy preference labels, for example human scoring of two candidate outputs with low confidence. This paper introduces DRDO (Direct Reward Distillation and policy-Optimization), which simultaneously models rewards and preferences to avoid such degeneracy. DRDO directly mimics rewards assigned by an oracle while learning human preferences with a novel preference likelihood formulation. Results on the Ultrafeedback and TL;DR datasets demonstrate that DRDO-trained policies surpass methods such as DPO and e-DPO in terms of expected rewards and are more robust, on average, to noisy preference signals as well as out-of-distribution (OOD) settings.
