Table of Contents
Fetching ...

Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jiawei Chen, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

TL;DR

The paper analyzes noise in Direct Preference Optimization (DPO) for aligning LLMs and frames robustness within Distributionally Robust Optimization (DRO). It shows DPO implicitly implements pointwise DRO and derives the relationship between the regularization parameter $β$ and the DRO radius $η$, with $β^*(η)=\sqrt{\mathrm{Var}_{π_{ref}}[r(x,y)]/(2η)}$, interpreting $β$ as a noise reflector. To address pairwise noise, it introduces Distributionally Robustifying DPO (Dr. DPO) with a new hyperparameter $β'$ that governs worst-case pairwise weighting, achieving robust performance with a minimal code change. Empirically, Dr. DPO improves text quality and response accuracy across noisy and clean settings on datasets like IMDB, Anthropic HH, and MT-Bench, outperforming DPO and several baselines. The work provides practical guidance on tuning $β'$ and demonstrates how pairwise robustness can be integrated into existing DPO workflows for more reliable LLM alignment in noisy real-world data.

Abstract

This study addresses the challenge of noise in training datasets for Direct Preference Optimization (DPO), a method for aligning Large Language Models (LLMs) with human preferences. We categorize noise into pointwise noise, which includes low-quality data points, and pairwise noise, which encompasses erroneous data pair associations that affect preference rankings. Utilizing Distributionally Robust Optimization (DRO), we enhance DPO's resilience to these types of noise. Our theoretical insights reveal that DPO inherently embeds DRO principles, conferring robustness to pointwise noise, with the regularization coefficient $β$ playing a critical role in its noise resistance. Extending this framework, we introduce Distributionally Robustifying DPO (Dr. DPO), which integrates pairwise robustness by optimizing against worst-case pairwise scenarios. The novel hyperparameter $β'$ in Dr. DPO allows for fine-tuned control over data pair reliability, providing a strategic balance between exploration and exploitation in noisy training environments. Empirical evaluations demonstrate that Dr. DPO substantially improves the quality of generated text and response accuracy in preference datasets, showcasing enhanced performance in both noisy and noise-free settings. The code is available at https://github.com/junkangwu/Dr_DPO.

Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

TL;DR

The paper analyzes noise in Direct Preference Optimization (DPO) for aligning LLMs and frames robustness within Distributionally Robust Optimization (DRO). It shows DPO implicitly implements pointwise DRO and derives the relationship between the regularization parameter and the DRO radius , with , interpreting as a noise reflector. To address pairwise noise, it introduces Distributionally Robustifying DPO (Dr. DPO) with a new hyperparameter that governs worst-case pairwise weighting, achieving robust performance with a minimal code change. Empirically, Dr. DPO improves text quality and response accuracy across noisy and clean settings on datasets like IMDB, Anthropic HH, and MT-Bench, outperforming DPO and several baselines. The work provides practical guidance on tuning and demonstrates how pairwise robustness can be integrated into existing DPO workflows for more reliable LLM alignment in noisy real-world data.

Abstract

This study addresses the challenge of noise in training datasets for Direct Preference Optimization (DPO), a method for aligning Large Language Models (LLMs) with human preferences. We categorize noise into pointwise noise, which includes low-quality data points, and pairwise noise, which encompasses erroneous data pair associations that affect preference rankings. Utilizing Distributionally Robust Optimization (DRO), we enhance DPO's resilience to these types of noise. Our theoretical insights reveal that DPO inherently embeds DRO principles, conferring robustness to pointwise noise, with the regularization coefficient playing a critical role in its noise resistance. Extending this framework, we introduce Distributionally Robustifying DPO (Dr. DPO), which integrates pairwise robustness by optimizing against worst-case pairwise scenarios. The novel hyperparameter in Dr. DPO allows for fine-tuned control over data pair reliability, providing a strategic balance between exploration and exploitation in noisy training environments. Empirical evaluations demonstrate that Dr. DPO substantially improves the quality of generated text and response accuracy in preference datasets, showcasing enhanced performance in both noisy and noise-free settings. The code is available at https://github.com/junkangwu/Dr_DPO.
Paper Structure (39 sections, 10 theorems, 67 equations, 12 figures, 9 tables)

This paper contains 39 sections, 10 theorems, 67 equations, 12 figures, 9 tables.

Key Result

Theorem 3.1

Let the Kullback-Leibler (KL) divergence between policy $\pi_\theta$ and reference policy $\pi_{\text{ref}}$ be defined as: $\mathbb{D}_{\text{KL}}(\pi_\theta|\pi_{\text{ref}}) = \int \pi_\theta(x) \log\left(\frac{\pi_\theta(x)}{\pi_{\text{ref}}(x)}\right) dx.$ Optimizing the RM-DRO objective as def Here, $\alpha, \beta$ are Lagrange multipliers, $\beta^*(\eta)$ denotes the optimal value of $\beta

Figures (12)

  • Figure 1: Left: An example illustrating pointwise and pairwise noise. Right: Comparison of gradients between DPO and Dr. DPO under varying levels of pairwise noise.
  • Figure 2: Impact of pointwise noise on the expected reward frontier and KL divergence in DPO ($\beta=0.1$).
  • Figure 3: (a) Comparative analysis of the effect of pointwise noise on the expected reward frontier for different $\beta$ values on IMDB dataset. (b) Comparative analysis of the effect of pointwise noise on on the win rate for different $\beta$ values on HH dataset. The star ($\textcolor{red}{\star},\textcolor{blue}{\star},\textcolor{codepurple}{\star}$) indicates the optimal $\beta$ selection for the corresponding pointwise noise ratio.
  • Figure 4: Left: Impact of pairwise noise on the expected reward frontier and KL divergence in DPO ($\beta=0.1$). Right: Comparative analysis of the effect of pairwise noise on the expected reward frontier for different $\beta$ values.
  • Figure 5: MT-Bench evaluates DPO and its variants using GPT-4, showing Win, Tie, and Loss rates at 0% and 40% pairwise noise levels in Figures 1-2. Figure 3 illustrates Dr. DPO's win rate across different $\phi$-divergences, while Figure 4 presents its Preference accuracy for varying $\beta^{\prime}$ values.
  • ...and 7 more figures

Theorems & Definitions (16)

  • Theorem 3.1: Optimal Reward Function under KL Divergence
  • Lemma 3.2
  • Theorem 4.1
  • Theorem 4.2: Upper Bound for Dr. DPO
  • Theorem B.1: Optimal Reward Function under KL Divergence
  • proof
  • Definition B.1: $\phi$-divergence nguyen2010estimating
  • Definition B.2: Convex conjugate hiriart2004fundamentals
  • Theorem B.3: Interchange of minimization and integration ben2007old
  • Theorem B.3
  • ...and 6 more