When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF

Yifan Xu; Xichen Ye; Yifan Chen; Qiaosheng Zhang

When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF

Yifan Xu, Xichen Ye, Yifan Chen, Qiaosheng Zhang

TL;DR

This work tackles the problem of instance-dependent preference flipping in RLHF data, which can severely degrade alignment quality. It introduces Flipping-Aware Direct Preference Optimization (FA-DPO), a post-training approach that explicitly models per-sample flipping probabilities via a feature-rich, instance-aware module, and integrates this into the DPO objective. The paper provides theoretical guarantees for consistency and convergence, and demonstrates through extensive experiments on UltraFeedback and HH_Golden that FA-DPO consistently outperforms vanilla DPO and other robust baselines across multiple LLM backbones. The approach achieves robust alignment with reduced sensitivity to noisy preferences while incurring comparable computational costs to standard DPO, making it practically impactful for scalable LLM alignment under realistic annotation noise.

Abstract

Quality of datasets plays an important role in large language model (LLM) alignment. In collecting human feedback, however, preference flipping is ubiquitous and causes corruption in data annotation; the issue necessitates the alignment algorithms with improved robustness against potential flipped pairs. To this end, this paper introduces a Flipping-Aware Direct Preference Optimization (FA-DPO) algorithm tailored to preference flipping from a reinforcement learning with human feedback (RLHF) perspective. We dissect the inherent human intention model and the preference flipping mechanism introduced by external factors as two distinct stages; in the latter, we introduce an instance-dependent flipping probability on the basis of the Bradley-Terry (BT) model. Further, by leveraging features relevant to preference annotation, we capture uncertainty in judgments and model preference flipping patterns. In practice, we design a simple yet efficient iterative optimization algorithm compatible with the original RLHF and DPO algorithms. In our experiments, we investigate the instance-dependent preference flipping model under multiple circumstances for evaluation of our proposed method, as well as other baseline methods.

When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF

TL;DR

Abstract

When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (7)