Table of Contents
Fetching ...

$β$-DPO: Direct Preference Optimization with Dynamic $β$

Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

TL;DR

This work analyzes how the DPO objective for aligning LLMs with human preferences depends on the trade-off parameter $\beta$ and the quality of pairwise data. It introduces $\beta$-DPO, a simple yet effective framework that dynamically calibrates $\beta$ at the batch level and applies $\beta$-guided data filtering to mitigate outliers. Empirical results across dialogue and summarization tasks show significant performance gains over static-$\beta$ DPO and other baselines, with demonstrated robustness across model sizes and data conditions. The approach is lightweight, model-agnostic, and readily integrable with existing DPO workflows, offering a practical path to more reliable human-aligned LLMs.

Abstract

Direct Preference Optimization (DPO) has emerged as a compelling approach for training Large Language Models (LLMs) to adhere to human preferences. However, the performance of DPO is sensitive to the fine-tuning of its trade-off parameter $β$, as well as to the quality of the preference data. We analyze the impact of $β$ and data quality on DPO, uncovering that optimal $β$ values vary with the informativeness of pairwise data. Addressing the limitations of static $β$ values, we introduce a novel framework that dynamically calibrates $β$ at the batch level, informed by data quality considerations. Additionally, our method incorporates $β$-guided data filtering to safeguard against the influence of outliers. Through empirical evaluation, we demonstrate that our dynamic $β$ adjustment technique significantly improves DPO's performance across a range of models and datasets, offering a more robust and adaptable training paradigm for aligning LLMs with human feedback. The code is available at \url{https://github.com/junkangwu/beta-DPO}.

$β$-DPO: Direct Preference Optimization with Dynamic $β$

TL;DR

This work analyzes how the DPO objective for aligning LLMs with human preferences depends on the trade-off parameter and the quality of pairwise data. It introduces -DPO, a simple yet effective framework that dynamically calibrates at the batch level and applies -guided data filtering to mitigate outliers. Empirical results across dialogue and summarization tasks show significant performance gains over static- DPO and other baselines, with demonstrated robustness across model sizes and data conditions. The approach is lightweight, model-agnostic, and readily integrable with existing DPO workflows, offering a practical path to more reliable human-aligned LLMs.

Abstract

Direct Preference Optimization (DPO) has emerged as a compelling approach for training Large Language Models (LLMs) to adhere to human preferences. However, the performance of DPO is sensitive to the fine-tuning of its trade-off parameter , as well as to the quality of the preference data. We analyze the impact of and data quality on DPO, uncovering that optimal values vary with the informativeness of pairwise data. Addressing the limitations of static values, we introduce a novel framework that dynamically calibrates at the batch level, informed by data quality considerations. Additionally, our method incorporates -guided data filtering to safeguard against the influence of outliers. Through empirical evaluation, we demonstrate that our dynamic adjustment technique significantly improves DPO's performance across a range of models and datasets, offering a more robust and adaptable training paradigm for aligning LLMs with human feedback. The code is available at \url{https://github.com/junkangwu/beta-DPO}.
Paper Structure (22 sections, 12 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 22 sections, 12 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: (\ref{['fig_1_1']}) Pairwise Data: Low vs. High Gap: "Low gap" denotes cases where the chosen and rejected examples are closely similar, typically indicating high-quality, informative pairs. "High gap" signifies pairs with larger differences, implying lower-quality data. (\ref{['fig_1_2']}) Influence of Data Quality on $\beta$ Selection: Pythia-1.4B's performance on the HH dataset reveals a distinct trend: for "Low gap", a higher $\beta$ reduces win rate, whereas for "High gap", an increased $\beta$ improves it.
  • Figure 2: Win rate performance of DPO across different $\beta$ settings on the low gap, mixed gap, and high gap datasets.
  • Figure 3: The distribution of individual reward discrepancy ($r(\mathbf{y}_w^{(i)};\mathbf{x}^{(i)})-r(\mathbf{y}_l^{(i)};\mathbf{x}^{(i)})$) on the training dataset of HH.
  • Figure 4: Left. The win rates computed by GPT-4 evaluations for the Anthropic-HH one-step dialogue; $\beta$-DPO consistently outperforms across all sampling temperatures. Right. In the comparison of TL;DR summarization win rates versus chosen summaries with GPT-4 as the evaluator, $\beta$-DPO is distinguished as the only strategy achieving a win rate over 50% across different sampling temperatures.
  • Figure 5: Left: Win rates from GPT-4 evaluations on Anthropic-HH single-turn dialogues, showcasing $\beta$-DPO's adaptability to diverse filtering strategies. Middle: Win rates of $\beta$-DPO across various DPO variants as evaluated by GPT-4. Right: Distribution of individual reward discrepancies following fine-tuning through batch-level and instance-level calibration.
  • ...and 2 more figures