Table of Contents
Fetching ...

When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF

Yifan Xu, Xichen Ye, Yifan Chen, Qiaosheng Zhang

TL;DR

This work tackles the problem of instance-dependent preference flipping in RLHF data, which can severely degrade alignment quality. It introduces Flipping-Aware Direct Preference Optimization (FA-DPO), a post-training approach that explicitly models per-sample flipping probabilities via a feature-rich, instance-aware module, and integrates this into the DPO objective. The paper provides theoretical guarantees for consistency and convergence, and demonstrates through extensive experiments on UltraFeedback and HH_Golden that FA-DPO consistently outperforms vanilla DPO and other robust baselines across multiple LLM backbones. The approach achieves robust alignment with reduced sensitivity to noisy preferences while incurring comparable computational costs to standard DPO, making it practically impactful for scalable LLM alignment under realistic annotation noise.

Abstract

Quality of datasets plays an important role in large language model (LLM) alignment. In collecting human feedback, however, preference flipping is ubiquitous and causes corruption in data annotation; the issue necessitates the alignment algorithms with improved robustness against potential flipped pairs. To this end, this paper introduces a Flipping-Aware Direct Preference Optimization (FA-DPO) algorithm tailored to preference flipping from a reinforcement learning with human feedback (RLHF) perspective. We dissect the inherent human intention model and the preference flipping mechanism introduced by external factors as two distinct stages; in the latter, we introduce an instance-dependent flipping probability on the basis of the Bradley-Terry (BT) model. Further, by leveraging features relevant to preference annotation, we capture uncertainty in judgments and model preference flipping patterns. In practice, we design a simple yet efficient iterative optimization algorithm compatible with the original RLHF and DPO algorithms. In our experiments, we investigate the instance-dependent preference flipping model under multiple circumstances for evaluation of our proposed method, as well as other baseline methods.

When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF

TL;DR

This work tackles the problem of instance-dependent preference flipping in RLHF data, which can severely degrade alignment quality. It introduces Flipping-Aware Direct Preference Optimization (FA-DPO), a post-training approach that explicitly models per-sample flipping probabilities via a feature-rich, instance-aware module, and integrates this into the DPO objective. The paper provides theoretical guarantees for consistency and convergence, and demonstrates through extensive experiments on UltraFeedback and HH_Golden that FA-DPO consistently outperforms vanilla DPO and other robust baselines across multiple LLM backbones. The approach achieves robust alignment with reduced sensitivity to noisy preferences while incurring comparable computational costs to standard DPO, making it practically impactful for scalable LLM alignment under realistic annotation noise.

Abstract

Quality of datasets plays an important role in large language model (LLM) alignment. In collecting human feedback, however, preference flipping is ubiquitous and causes corruption in data annotation; the issue necessitates the alignment algorithms with improved robustness against potential flipped pairs. To this end, this paper introduces a Flipping-Aware Direct Preference Optimization (FA-DPO) algorithm tailored to preference flipping from a reinforcement learning with human feedback (RLHF) perspective. We dissect the inherent human intention model and the preference flipping mechanism introduced by external factors as two distinct stages; in the latter, we introduce an instance-dependent flipping probability on the basis of the Bradley-Terry (BT) model. Further, by leveraging features relevant to preference annotation, we capture uncertainty in judgments and model preference flipping patterns. In practice, we design a simple yet efficient iterative optimization algorithm compatible with the original RLHF and DPO algorithms. In our experiments, we investigate the instance-dependent preference flipping model under multiple circumstances for evaluation of our proposed method, as well as other baseline methods.

Paper Structure

This paper contains 42 sections, 4 theorems, 50 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Proposition 4.1

For any input $\tilde{\bm{x}}$, the corrupted preference probability, under the instance-dependent preference flipping setting, relates to the true preference likelihood via: where $\varepsilon_{\tilde{\bm{x}}}$ represents the instance-specific flipping probability for triplet $\tilde{\bm{x}}= (x,\tilde{y}_w,\tilde{y}_l)$, and $p$ denotes the true likelihood $\mathbb{P}\{\tilde{y}_w\succ \tilde{y

Figures (6)

  • Figure 1: Characterization of learned preference flipping distribution. (a) Correlation between actual and predicted noise probabilities with regression line; (b) Predicted flipping distributions separated by flipping status; (c) Pattern of predicted flipping distribution with length-based features.
  • Figure : (a) Flipping Model Characterization at Flip Ratio 0.1
  • Figure : (a) Flipping Model Characterization at Flip Ratio 0.1
  • Figure : (b) Flipping Model Characterization at Flip Ratio 0.2
  • Figure : (c) Flipping Model Characterization at Flip Ratio 0.3
  • ...and 1 more figures

Theorems & Definitions (7)

  • Proposition 4.1: Instance-dependent preference flipping
  • Lemma 4.2: Gradient weight coefficient
  • Theorem 4.3: Consistency of $\bm{p}_\theta$
  • Theorem 4.5: Linear Convergence of $\hat{\omega}$
  • proof
  • proof
  • proof