Table of Contents
Fetching ...

Adaptive Margin RLHF via Preference over Preferences

Yaswanth Chittepu, Prasann Singhal, Greg Durrett, Scott Niekum

TL;DR

The paper introduces DPO-PoP, a margin-aware alignment framework for RLHF that uses Preference-over-Preference annotations to infer adaptive margins per datapoint, bypassing the need for precise scalar ratings. By extending Direct Preference Optimization to incorporate PoP signals, the method improves both discriminative alignment and generative quality, with two sampling strategies (iterative and random) that balance these goals. Empirical results on UltraFeedback, RewardBench, and AlpacaEval 2.0 show that PoP-based margins can outperform fixed or ground-truth margins, though a tradeoff exists between classification accuracy and generation. Practitioners can choose the PoP sampling strategy to emphasize domain-specific discrimination or broader generative robustness, offering a practical path to better RLHF alignment with ordinal supervision.

Abstract

Margin-based optimization is fundamental to improving generalization and robustness in classification tasks. In the context of reward model learning from preferences within Reinforcement Learning from Human Feedback (RLHF), existing methods typically rely on no margins, fixed margins, or margins that are simplistic functions of preference ratings. However, such formulations often fail to account for the varying strengths of different preferences, for example some preferences are associated with larger margins between responses, or they rely on noisy margin information derived from ratings. We argue that modeling the strength of preferences can lead to better generalization and more faithful alignment. Furthermore, many existing methods that use adaptive margins assume access to accurate preference scores, which can be difficult for humans to provide reliably. We propose an approach that leverages preferences over preferences, that is annotations indicating which of two preferences reflects a stronger distinction. We use this ordinal signal to infer adaptive margins on a per-datapoint basis. We introduce an extension to Direct Preference Optimization (DPO), DPO-PoP, that incorporates adaptive margins from preference-over-preference supervision, enabling improved discriminative and generative performance. Empirically, our method outperforms vanilla DPO, DPO with fixed margins, and DPO with ground-truth margins on the UltraFeedback dataset. Additionally, we show that there is a tradeoff between discriminative and generative performance: improving test classification accuracy, particularly by correctly labeling weaker preferences at the expense of stronger ones, can lead to a decline in generative quality. To navigate this tradeoff, we propose two sampling strategies to gather preference-over-preference labels: one favoring discriminative performance and one favoring generative performance.

Adaptive Margin RLHF via Preference over Preferences

TL;DR

The paper introduces DPO-PoP, a margin-aware alignment framework for RLHF that uses Preference-over-Preference annotations to infer adaptive margins per datapoint, bypassing the need for precise scalar ratings. By extending Direct Preference Optimization to incorporate PoP signals, the method improves both discriminative alignment and generative quality, with two sampling strategies (iterative and random) that balance these goals. Empirical results on UltraFeedback, RewardBench, and AlpacaEval 2.0 show that PoP-based margins can outperform fixed or ground-truth margins, though a tradeoff exists between classification accuracy and generation. Practitioners can choose the PoP sampling strategy to emphasize domain-specific discrimination or broader generative robustness, offering a practical path to better RLHF alignment with ordinal supervision.

Abstract

Margin-based optimization is fundamental to improving generalization and robustness in classification tasks. In the context of reward model learning from preferences within Reinforcement Learning from Human Feedback (RLHF), existing methods typically rely on no margins, fixed margins, or margins that are simplistic functions of preference ratings. However, such formulations often fail to account for the varying strengths of different preferences, for example some preferences are associated with larger margins between responses, or they rely on noisy margin information derived from ratings. We argue that modeling the strength of preferences can lead to better generalization and more faithful alignment. Furthermore, many existing methods that use adaptive margins assume access to accurate preference scores, which can be difficult for humans to provide reliably. We propose an approach that leverages preferences over preferences, that is annotations indicating which of two preferences reflects a stronger distinction. We use this ordinal signal to infer adaptive margins on a per-datapoint basis. We introduce an extension to Direct Preference Optimization (DPO), DPO-PoP, that incorporates adaptive margins from preference-over-preference supervision, enabling improved discriminative and generative performance. Empirically, our method outperforms vanilla DPO, DPO with fixed margins, and DPO with ground-truth margins on the UltraFeedback dataset. Additionally, we show that there is a tradeoff between discriminative and generative performance: improving test classification accuracy, particularly by correctly labeling weaker preferences at the expense of stronger ones, can lead to a decline in generative quality. To navigate this tradeoff, we propose two sampling strategies to gather preference-over-preference labels: one favoring discriminative performance and one favoring generative performance.

Paper Structure

This paper contains 58 sections, 5 theorems, 62 equations, 5 figures, 20 tables.

Key Result

Theorem 1

Assume equation eq:radius-assumption and equation eq:norm-assumption, and let $\delta \in (0,1)$. Then with probability at least $1-\delta$ over the sample $S\sim\mathcal{D}^n$, we have simultaneously for all $w$ with $\|w\|_2 \le \Lambda$, In particular, the left-hand side depends only on the test score $w^\top \Psi$ and does not require access to $M$ at test time; the adaptive margins $m_i$ app

Figures (5)

  • Figure 1: A pictorial illustration of the PoP framework. A preference is stronger than another when the reward difference between its preferred and dispreferred responses is larger. The reward difference of the weaker preference in the pair serves as the margin for the stronger preference.
  • Figure 2: Cumulative Accuracy vs Margin for the different DPO variants considered. Lower Cumulative Accuracy at margin $m$ indicates the accuracy of predicting preference labels using only datapoints with ground-truth margin less than or equal to $m$. Conversely, Upper Cumulative Accuracy reflects prediction accuracy on datapoints with ground-truth margin greater than or equal to $m$. The dark grey histogram shows the distribution (density) of margin values in the test set. In plot (a), DPO-PoP-Iter achieves higher accuracy on datapoints with lower margins, while in plot (b), its performance drops for higher margin datapoints.
  • Figure 3: Spearman and Pearson correlations (left), and test classification accuracy (right) of DPO-PoP models trained with varying levels of label noise.
  • Figure 4: Win rates (left) and median advantage (right) of DPO-PoP models trained with varying levels of label noise.
  • Figure 5: Training curves for test classification accuracy, UltraRM-winrate, and KL with respect to the reference policy.

Theorems & Definitions (7)

  • Theorem 1: Adaptive-margin logistic generalization bound
  • Lemma 1
  • proof
  • Lemma 2: Uniform deviation for bounded losses
  • Lemma 3: Per-example contraction
  • Lemma 4: Ramp vs. logistic
  • proof