Table of Contents
Fetching ...

Lightweight Robust Direct Preference Optimization

Cheol Woo Kim, Shresth Verma, Mauricio Tec, Milind Tambe

TL;DR

This paper addresses the inconsistency of Direct Preference Optimization (DPO) under noisy preference signals by introducing DPO-PRO, a lightweight, preference-focused distributionally robust optimization approach. DPO-PRO models uncertainty in the preference distribution with a chi-squared divergence-based ambiguity set, derives a closed-form worst-case probability, and shows the DRO loss is equivalent to a regularized DPO loss that calibrates model confidence. The method maintains practical efficiency, incurring negligible overhead, and achieves improved robustness on standard alignment benchmarks (e.g., UltraFeedback) and a real-world public health reward-design task, outperforming vanilla DPO and prior DRO variants especially under higher noise. Overall, DPO-PRO offers a principled, scalable way to stabilize preference-based fine-tuning by concentrating robustness on the noisy signal while preserving learning speed and effectiveness.

Abstract

Direct Preference Optimization (DPO) has become a popular method for fine-tuning large language models (LLMs) due to its stability and simplicity. However, it is also known to be sensitive to noise in the data and prone to overfitting. Recent works have proposed using distributionally robust optimization (DRO) to address potential noise and distributional shift in the data. However, these methods often suffer from excessive conservatism and high computational cost. We propose DPO-PRO (DPO with Preference Robustness), a robust fine-tuning algorithm based on DPO which accounts for uncertainty in the preference distribution through a lightweight DRO formulation. Unlike prior DRO-based variants, DPO-PRO focuses solely on uncertainty in preferences, avoiding unnecessary conservatism and incurring negligible computational overhead. We further show that DPO-PRO is equivalent to a regularized DPO objective that penalizes model overconfidence under weak preference signals. We evaluate DPO-PRO on standard alignment benchmarks and a real-world public health task. Experimental results show that our method consistently improves robustness to noisy preference signals compared to existing DPO variants.

Lightweight Robust Direct Preference Optimization

TL;DR

This paper addresses the inconsistency of Direct Preference Optimization (DPO) under noisy preference signals by introducing DPO-PRO, a lightweight, preference-focused distributionally robust optimization approach. DPO-PRO models uncertainty in the preference distribution with a chi-squared divergence-based ambiguity set, derives a closed-form worst-case probability, and shows the DRO loss is equivalent to a regularized DPO loss that calibrates model confidence. The method maintains practical efficiency, incurring negligible overhead, and achieves improved robustness on standard alignment benchmarks (e.g., UltraFeedback) and a real-world public health reward-design task, outperforming vanilla DPO and prior DRO variants especially under higher noise. Overall, DPO-PRO offers a principled, scalable way to stabilize preference-based fine-tuning by concentrating robustness on the noisy signal while preserving learning speed and effectiveness.

Abstract

Direct Preference Optimization (DPO) has become a popular method for fine-tuning large language models (LLMs) due to its stability and simplicity. However, it is also known to be sensitive to noise in the data and prone to overfitting. Recent works have proposed using distributionally robust optimization (DRO) to address potential noise and distributional shift in the data. However, these methods often suffer from excessive conservatism and high computational cost. We propose DPO-PRO (DPO with Preference Robustness), a robust fine-tuning algorithm based on DPO which accounts for uncertainty in the preference distribution through a lightweight DRO formulation. Unlike prior DRO-based variants, DPO-PRO focuses solely on uncertainty in preferences, avoiding unnecessary conservatism and incurring negligible computational overhead. We further show that DPO-PRO is equivalent to a regularized DPO objective that penalizes model overconfidence under weak preference signals. We evaluate DPO-PRO on standard alignment benchmarks and a real-world public health task. Experimental results show that our method consistently improves robustness to noisy preference signals compared to existing DPO variants.

Paper Structure

This paper contains 42 sections, 2 theorems, 20 equations, 3 figures, 5 tables.

Key Result

Proposition 4.1

Eq eq:grad_est provides an unbiased gradient estimate of the DPO-PRO loss in Eq eq:dro_loss.

Figures (3)

  • Figure 1: Visualization of the uncertainty-weighted coefficient for various values of $\rho$. This coefficient attains its maximum near $q = 0.5$ and decreases as $q \to 0$ or $q \to 1$. For small values of $\rho$, the maximum is achieved exactly at $q = 0.5$, where $\sqrt{\rho\, q(1 - q)}$ peaks. However, for larger $\rho$, the maximum occurs at the intersection point between the curves $1 - q$ and $\sqrt{\rho\, q(1 - q)}$. This is because $q + \sqrt{\rho\, q(1 - q)}$ (the worst-case distribution under the chi-squared divergence constraint) may exceed 1 when $\rho$ is large, causing the worst-case distribution $\hat{p}$ to be clipped at 1. In such cases, the adversary’s perturbation $\hat{p} - q$ can be greater when $q < 0.5$, leading the penalty term to peak at a value of $q$ smaller than 0.5.
  • Figure 2: Prompt passed to the LLM to generate a reward function based on the context of the problem scenario in the Real World Domain.
  • Figure 3: Prompt passed to the LLM to choose a reward function based on the context of problem scenario in Real World Domain, the generated reward functions and the reward distribution corresponding to every reward function.

Theorems & Definitions (2)

  • Proposition 4.1
  • Proposition 4.2