Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

Junkang Wu; Yuexiang Xie; Zhengyi Yang; Jiancan Wu; Jiawei Chen; Jinyang Gao; Bolin Ding; Xiang Wang; Xiangnan He

Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jiawei Chen, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

TL;DR

The paper analyzes noise in Direct Preference Optimization (DPO) for aligning LLMs and frames robustness within Distributionally Robust Optimization (DRO). It shows DPO implicitly implements pointwise DRO and derives the relationship between the regularization parameter $β$ and the DRO radius $η$, with $β^*(η)=\sqrt{\mathrm{Var}_{π_{ref}}[r(x,y)]/(2η)}$, interpreting $β$ as a noise reflector. To address pairwise noise, it introduces Distributionally Robustifying DPO (Dr. DPO) with a new hyperparameter $β'$ that governs worst-case pairwise weighting, achieving robust performance with a minimal code change. Empirically, Dr. DPO improves text quality and response accuracy across noisy and clean settings on datasets like IMDB, Anthropic HH, and MT-Bench, outperforming DPO and several baselines. The work provides practical guidance on tuning $β'$ and demonstrates how pairwise robustness can be integrated into existing DPO workflows for more reliable LLM alignment in noisy real-world data.

Abstract

This study addresses the challenge of noise in training datasets for Direct Preference Optimization (DPO), a method for aligning Large Language Models (LLMs) with human preferences. We categorize noise into pointwise noise, which includes low-quality data points, and pairwise noise, which encompasses erroneous data pair associations that affect preference rankings. Utilizing Distributionally Robust Optimization (DRO), we enhance DPO's resilience to these types of noise. Our theoretical insights reveal that DPO inherently embeds DRO principles, conferring robustness to pointwise noise, with the regularization coefficient $β$ playing a critical role in its noise resistance. Extending this framework, we introduce Distributionally Robustifying DPO (Dr. DPO), which integrates pairwise robustness by optimizing against worst-case pairwise scenarios. The novel hyperparameter $β'$ in Dr. DPO allows for fine-tuned control over data pair reliability, providing a strategic balance between exploration and exploitation in noisy training environments. Empirical evaluations demonstrate that Dr. DPO substantially improves the quality of generated text and response accuracy in preference datasets, showcasing enhanced performance in both noisy and noise-free settings. The code is available at https://github.com/junkangwu/Dr_DPO.

Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

TL;DR

and the DRO radius

, with

, interpreting

as a noise reflector. To address pairwise noise, it introduces Distributionally Robustifying DPO (Dr. DPO) with a new hyperparameter

that governs worst-case pairwise weighting, achieving robust performance with a minimal code change. Empirically, Dr. DPO improves text quality and response accuracy across noisy and clean settings on datasets like IMDB, Anthropic HH, and MT-Bench, outperforming DPO and several baselines. The work provides practical guidance on tuning

and demonstrates how pairwise robustness can be integrated into existing DPO workflows for more reliable LLM alignment in noisy real-world data.

Abstract

playing a critical role in its noise resistance. Extending this framework, we introduce Distributionally Robustifying DPO (Dr. DPO), which integrates pairwise robustness by optimizing against worst-case pairwise scenarios. The novel hyperparameter

in Dr. DPO allows for fine-tuned control over data pair reliability, providing a strategic balance between exploration and exploitation in noisy training environments. Empirical evaluations demonstrate that Dr. DPO substantially improves the quality of generated text and response accuracy in preference datasets, showcasing enhanced performance in both noisy and noise-free settings. The code is available at https://github.com/junkangwu/Dr_DPO.

Paper Structure (39 sections, 10 theorems, 67 equations, 12 figures, 9 tables)

This paper contains 39 sections, 10 theorems, 67 equations, 12 figures, 9 tables.

Introduction
Preliminaries
Analyzing DPO's Pointwise Robustness
Pointwise Noise Impairs DPO Performance
Pointwise Robustness in Reward Modeling
Dr. DPO: Toward Pairwise Robustness
Pairwise Noise Impairs DPO Convergence and Performance
Distributionally Robustifying DPO
Why is Dr. DPO Robust to Pairwise Noise?
Experiments
How Well can Dr. DPO Resist the Pairwise Noise?
Comparing Dr. DPO with Baselines on MT-Bench
Ablation Studies on Dr. DPO
Discussion
Related Work
...and 24 more sections

Key Result

Theorem 3.1

Let the Kullback-Leibler (KL) divergence between policy $\pi_\theta$ and reference policy $\pi_{\text{ref}}$ be defined as: $\mathbb{D}_{\text{KL}}(\pi_\theta|\pi_{\text{ref}}) = \int \pi_\theta(x) \log\left(\frac{\pi_\theta(x)}{\pi_{\text{ref}}(x)}\right) dx.$ Optimizing the RM-DRO objective as def Here, $\alpha, \beta$ are Lagrange multipliers, $\beta^*(\eta)$ denotes the optimal value of $\beta

Figures (12)

Figure 1: Left: An example illustrating pointwise and pairwise noise. Right: Comparison of gradients between DPO and Dr. DPO under varying levels of pairwise noise.
Figure 2: Impact of pointwise noise on the expected reward frontier and KL divergence in DPO ($\beta=0.1$).
Figure 3: (a) Comparative analysis of the effect of pointwise noise on the expected reward frontier for different $\beta$ values on IMDB dataset. (b) Comparative analysis of the effect of pointwise noise on on the win rate for different $\beta$ values on HH dataset. The star ($\textcolor{red}{\star},\textcolor{blue}{\star},\textcolor{codepurple}{\star}$) indicates the optimal $\beta$ selection for the corresponding pointwise noise ratio.
Figure 4: Left: Impact of pairwise noise on the expected reward frontier and KL divergence in DPO ($\beta=0.1$). Right: Comparative analysis of the effect of pairwise noise on the expected reward frontier for different $\beta$ values.
Figure 5: MT-Bench evaluates DPO and its variants using GPT-4, showing Win, Tie, and Loss rates at 0% and 40% pairwise noise levels in Figures 1-2. Figure 3 illustrates Dr. DPO's win rate across different $\phi$-divergences, while Figure 4 presents its Preference accuracy for varying $\beta^{\prime}$ values.
...and 7 more figures

Theorems & Definitions (16)

Theorem 3.1: Optimal Reward Function under KL Divergence
Lemma 3.2
Theorem 4.1
Theorem 4.2: Upper Bound for Dr. DPO
Theorem B.1: Optimal Reward Function under KL Divergence
proof
Definition B.1: $\phi$-divergence nguyen2010estimating
Definition B.2: Convex conjugate hiriart2004fundamentals
Theorem B.3: Interchange of minimization and integration ben2007old
Theorem B.3
...and 6 more

Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

TL;DR

Abstract

Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (16)