Table of Contents
Fetching ...

Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization

Yuxin Jiang, Bo Huang, Yufei Wang, Xingshan Zeng, Liangyou Li, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Wei Wang

TL;DR

This paper addresses limitations in direct preference optimization (DPO) caused by weak correlations between independently generated winning and losing responses. It introduces BMC, a two-phase framework comprising Bridging Phase (data synthesis to produce a pseudo-winning reference) and Modeling Phase (token-level reward weighting guided by policy confidence) to better capture correlations and fine-grained distinctions. Through extensive QA, math, and instruction-following experiments, BMC consistently outperforms strong offline baselines, with ablations confirming the necessity of both phases and adaptive token weighting. The approach scales across DPO variants and model sizes, offering practical gains with modest computational overhead and broad applicability to future preference-learning pipelines.

Abstract

Direct preference optimization (DPO), a widely adopted offline preference optimization algorithm, aims to align large language models (LLMs) with human-desired behaviors using pairwise preference data. However, the generation of the winning response and the losing response within pairwise data are typically isolated, leading to weak correlations between them as well as suboptimal alignment performance. To address this issue, we propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC. Firstly, we increase the consistency and informativeness of the pairwise preference signals through targeted modifications, synthesizing a pseudo-winning response by improving the losing response with the winning response as a reference. Secondly, we identify that DPO alone is insufficient to model these correlations and capture nuanced variations. Therefore, we propose learning token-level correlations by dynamically leveraging the policy model's confidence during training. Comprehensive experiments on QA, math, and instruction-following tasks demonstrate the effectiveness of our approach, significantly surpassing competitive baselines, including DPO. Additionally, our in-depth quantitative analysis reveals the reasons behind our method's superior performance over DPO and showcases its versatility to other DPO variants. We release our repository at https://github.com/YJiangcm/BMC.

Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization

TL;DR

This paper addresses limitations in direct preference optimization (DPO) caused by weak correlations between independently generated winning and losing responses. It introduces BMC, a two-phase framework comprising Bridging Phase (data synthesis to produce a pseudo-winning reference) and Modeling Phase (token-level reward weighting guided by policy confidence) to better capture correlations and fine-grained distinctions. Through extensive QA, math, and instruction-following experiments, BMC consistently outperforms strong offline baselines, with ablations confirming the necessity of both phases and adaptive token weighting. The approach scales across DPO variants and model sizes, offering practical gains with modest computational overhead and broad applicability to future preference-learning pipelines.

Abstract

Direct preference optimization (DPO), a widely adopted offline preference optimization algorithm, aims to align large language models (LLMs) with human-desired behaviors using pairwise preference data. However, the generation of the winning response and the losing response within pairwise data are typically isolated, leading to weak correlations between them as well as suboptimal alignment performance. To address this issue, we propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC. Firstly, we increase the consistency and informativeness of the pairwise preference signals through targeted modifications, synthesizing a pseudo-winning response by improving the losing response with the winning response as a reference. Secondly, we identify that DPO alone is insufficient to model these correlations and capture nuanced variations. Therefore, we propose learning token-level correlations by dynamically leveraging the policy model's confidence during training. Comprehensive experiments on QA, math, and instruction-following tasks demonstrate the effectiveness of our approach, significantly surpassing competitive baselines, including DPO. Additionally, our in-depth quantitative analysis reveals the reasons behind our method's superior performance over DPO and showcases its versatility to other DPO variants. We release our repository at https://github.com/YJiangcm/BMC.
Paper Structure (42 sections, 7 equations, 9 figures, 12 tables)

This paper contains 42 sections, 7 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Overview of our proposed BMC framework. (1) In the Bridging Phase, we utilize an off-the-shelf LLM to make targeted modifications of losing response $y_l$ on undesired tokens, with the winning response $y_w$ serving as a reference. Therefore, the synthesized pseudo-winning response $\Tilde{y}_w$ is highly correlated with $y_l$. (2) In the Modeling Phase, we model the correlations between $\Tilde{y}_w$ and $y_l$ by dynamically emphasizing the rewards of their varied tokens ( $\mathit{diff} (\Tilde{y}_w \mid y_l)$ and $\mathit{diff} (y_l \mid \Tilde{y}_w)$ ), leveraging the policy model confidence (numbers indicated above tokens) during training.
  • Figure 3: Ablation study on diverse data synthesis methods in the Bridging Phase. The average accuracy is presented for QA and Math. LC on AlpacaEval 2 is reported for instruction following (IF), based on Llama3-8B.
  • Figure 4: Influence of diverse LLMs for targeted modification in the Bridging Phase. The average accuracy is presented for QA and Math. LC on AlpacaEval 2 is reported for instruction following (IF), based on Llama3-8B.
  • Figure 5: We segment the 60k training data of UltraFeedback into six equal-sized splits based on increasing edit distance between winning and losing responses. For each split, we report LC on AlpacaEval 2 and the average gradient norm during training.
  • Figure 5: Versatility of our framework across various XPOs..
  • ...and 4 more figures