$β$-DPO: Direct Preference Optimization with Dynamic $β$
Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He
TL;DR
This work analyzes how the DPO objective for aligning LLMs with human preferences depends on the trade-off parameter $\beta$ and the quality of pairwise data. It introduces $\beta$-DPO, a simple yet effective framework that dynamically calibrates $\beta$ at the batch level and applies $\beta$-guided data filtering to mitigate outliers. Empirical results across dialogue and summarization tasks show significant performance gains over static-$\beta$ DPO and other baselines, with demonstrated robustness across model sizes and data conditions. The approach is lightweight, model-agnostic, and readily integrable with existing DPO workflows, offering a practical path to more reliable human-aligned LLMs.
Abstract
Direct Preference Optimization (DPO) has emerged as a compelling approach for training Large Language Models (LLMs) to adhere to human preferences. However, the performance of DPO is sensitive to the fine-tuning of its trade-off parameter $β$, as well as to the quality of the preference data. We analyze the impact of $β$ and data quality on DPO, uncovering that optimal $β$ values vary with the informativeness of pairwise data. Addressing the limitations of static $β$ values, we introduce a novel framework that dynamically calibrates $β$ at the batch level, informed by data quality considerations. Additionally, our method incorporates $β$-guided data filtering to safeguard against the influence of outliers. Through empirical evaluation, we demonstrate that our dynamic $β$ adjustment technique significantly improves DPO's performance across a range of models and datasets, offering a more robust and adaptable training paradigm for aligning LLMs with human feedback. The code is available at \url{https://github.com/junkangwu/beta-DPO}.
