Provably Robust DPO: Aligning Language Models with Noisy Feedback

Sayak Ray Chowdhury; Anush Kini; Nagarajan Natarajan

Provably Robust DPO: Aligning Language Models with Noisy Feedback

Sayak Ray Chowdhury, Anush Kini, Nagarajan Natarajan

TL;DR

This paper tackles the problem of learning language model policies from noisy human preferences within the DPO framework. It introduces robust DPO (rDPO), an unbiased loss that corrects for random label flips, and derives gradient updates that de-bias noise. The authors provide theoretical guarantees showing estimation-error and sub-optimality bounds that degrade gracefully with noise, and they validate the approach with experiments on sentiment generation and dialogue data, where rDPO outperforms DPO and heuristic methods under substantial noise. The framework also generalizes to other preference models and to reward training in RLHF, suggesting broad applicability to robust preference-based learning pipelines.

Abstract

Learning from preference-based feedback has recently gained traction as a promising approach to align language models with human interests. While these aligned generative models have demonstrated impressive capabilities across various tasks, their dependence on high-quality human preference data poses a bottleneck in practical applications. Specifically, noisy (incorrect and ambiguous) preference pairs in the dataset might restrict the language models from capturing human intent accurately. While practitioners have recently proposed heuristics to mitigate the effect of noisy preferences, a complete theoretical understanding of their workings remain elusive. In this work, we aim to bridge this gap by by introducing a general framework for policy optimization in the presence of random preference flips. We focus on the direct preference optimization (DPO) algorithm in particular since it assumes that preferences adhere to the Bradley-Terry-Luce (BTL) model, raising concerns about the impact of noisy data on the learned policy. We design a novel loss function, which de-bias the effect of noise on average, making a policy trained by minimizing that loss robust to the noise. Under log-linear parameterization of the policy class and assuming good feature coverage of the SFT policy, we prove that the sub-optimality gap of the proposed robust DPO (rDPO) policy compared to the optimal policy is of the order $O(\frac{1}{1-2ε}\sqrt{\frac{d}{n}})$, where $ε< 1/2$ is flip rate of labels, $d$ is policy parameter dimension and $n$ is size of dataset. Our experiments on IMDb sentiment generation and Anthropic's helpful-harmless dataset show that rDPO is robust to noise in preference labels compared to vanilla DPO and other heuristics proposed by practitioners.

Provably Robust DPO: Aligning Language Models with Noisy Feedback

TL;DR

Abstract

, where

is flip rate of labels,

is policy parameter dimension and

is size of dataset. Our experiments on IMDb sentiment generation and Anthropic's helpful-harmless dataset show that rDPO is robust to noise in preference labels compared to vanilla DPO and other heuristics proposed by practitioners.

Paper Structure (19 sections, 6 theorems, 84 equations, 1 figure, 5 tables)

This paper contains 19 sections, 6 theorems, 84 equations, 1 figure, 5 tables.

Introduction
Related Work
Background and Problem Setup
Our Approach: Robust DPO
An Unbiased Loss Function
Gradients of rDPO Loss
Theoretical Analysis
Estimation Error
Performance Bounds of Learned Policy
Generalizations and Extensions
Experiments
Missing Details
Proof of Lemma \ref{['lem:robust_loss']}
Variance of rDPO loss
Proof of Lemma \ref{['ref:grads']}
...and 4 more sections

Key Result

Lemma 3.1

For any $\theta,\theta_0\!\in\! \mathbb{R}^d$, $\varepsilon \!\in\! (0,1/2)$, we have

Figures (1)

Figure 1: Mean reward on IMDb dataset at different sampling temperatures after 1000 steps.

Theorems & Definitions (6)

Lemma 3.1
Lemma 3.2: Gradient weights
Theorem 4.2: Estimation error of $\widehat{\theta}_n$
Theorem 4.4: Sub-optimality gap of $\widehat{\pi}_n$
Lemma 4.5: Margin gap
Lemma 1.1

Provably Robust DPO: Aligning Language Models with Noisy Feedback

TL;DR

Abstract

Provably Robust DPO: Aligning Language Models with Noisy Feedback

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (6)