Table of Contents
Fetching ...

InCo-DPO: Balancing Distribution Shift and Data Quality for Enhanced Preference Optimization

Yunan Wang, Jijie Li, Bo-Wen Zhang, Liangdong Wang, Guang Liu

TL;DR

This work tackles Direct Preference Optimization (DPO) for aligning LLMs by highlighting data quality as a key factor alongside distribution shift. It introduces InCo-DPO, a continuation-based data synthesis method that combines on-policy and off-policy samples via a tunable prefix from a strong model, enabling dynamic trade-offs between reward and policy consistency. Empirical results across Alpaca Eval 2.0, Arena-Hard, and multiple open-source models show InCo-DPO outperforms pure on-policy and off-policy data and achieves state-of-the-art win rates (e.g., 60.8 on Arena-Hard with Gemma-2). The approach advances data-efficient, stable preference optimization and offers broad applicability across datasets and model families, though future work should extend safety-focused evaluation and multi-turn alignment.

Abstract

Direct Preference Optimization (DPO) optimizes language models to align with human preferences. Utilizing on-policy samples, generated directly by the policy model, typically results in better performance due to its distribution consistency with the model compared to off-policy samples. This paper identifies the quality of candidate preference samples as another critical factor. While the quality of on-policy data is inherently constrained by the capabilities of the policy model, off-policy data, which can be derived from diverse sources, offers greater potential for quality despite experiencing distribution shifts. However, current research mostly relies on on-policy data and neglects the value of off-policy data in terms of data quality, due to the challenge posed by distribution shift. In this paper, we propose InCo-DPO, an efficient method for synthesizing preference data by integrating on-policy and off-policy data, allowing dynamic adjustments to balance distribution shifts and data quality, thus finding an optimal trade-off. Consequently, InCo-DPO overcomes the limitations of distribution shifts in off-policy data and the quality constraints of on-policy data. We evaluated InCo-DPO with the Alpaca-Eval 2.0 and Arena-Hard benchmarks. Experimental results demonstrate that our approach not only outperforms both on-policy and off-policy data but also achieves a state-of-the-art win rate of 60.8 on Arena-Hard with the vanilla DPO using Gemma-2 model.

InCo-DPO: Balancing Distribution Shift and Data Quality for Enhanced Preference Optimization

TL;DR

This work tackles Direct Preference Optimization (DPO) for aligning LLMs by highlighting data quality as a key factor alongside distribution shift. It introduces InCo-DPO, a continuation-based data synthesis method that combines on-policy and off-policy samples via a tunable prefix from a strong model, enabling dynamic trade-offs between reward and policy consistency. Empirical results across Alpaca Eval 2.0, Arena-Hard, and multiple open-source models show InCo-DPO outperforms pure on-policy and off-policy data and achieves state-of-the-art win rates (e.g., 60.8 on Arena-Hard with Gemma-2). The approach advances data-efficient, stable preference optimization and offers broad applicability across datasets and model families, though future work should extend safety-focused evaluation and multi-turn alignment.

Abstract

Direct Preference Optimization (DPO) optimizes language models to align with human preferences. Utilizing on-policy samples, generated directly by the policy model, typically results in better performance due to its distribution consistency with the model compared to off-policy samples. This paper identifies the quality of candidate preference samples as another critical factor. While the quality of on-policy data is inherently constrained by the capabilities of the policy model, off-policy data, which can be derived from diverse sources, offers greater potential for quality despite experiencing distribution shifts. However, current research mostly relies on on-policy data and neglects the value of off-policy data in terms of data quality, due to the challenge posed by distribution shift. In this paper, we propose InCo-DPO, an efficient method for synthesizing preference data by integrating on-policy and off-policy data, allowing dynamic adjustments to balance distribution shifts and data quality, thus finding an optimal trade-off. Consequently, InCo-DPO overcomes the limitations of distribution shifts in off-policy data and the quality constraints of on-policy data. We evaluated InCo-DPO with the Alpaca-Eval 2.0 and Arena-Hard benchmarks. Experimental results demonstrate that our approach not only outperforms both on-policy and off-policy data but also achieves a state-of-the-art win rate of 60.8 on Arena-Hard with the vanilla DPO using Gemma-2 model.

Paper Structure

This paper contains 20 sections, 4 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Comparison of reward and evaluation results on AlpacaEval2 between on-policy data and off-policy data. We use Mistral-Base as its minimal post-training reduces distribution shift problems, better revealing the relation between reward and performance.
  • Figure 2: Relationship between rewards of partial and final responses with a correlation coefficient of 0.81. The red line represents the fitted linear regression line, indicating a significant positive relationship.
  • Figure 3: Different data synthesis methods. On-policy data is completely sampled from the policy model. Continuation-based sampling (InCo-DPO): the policy model directly continues and completes partial outputs as prefixes from a stronger model under a lower temperature. Rewriting-based sampling: we prompt the policy model to rewrite completed responses from a stronger model.
  • Figure 4: The impact of prefix token number on consistency weight and reward. When the Prefix Token Number is less than 10, the consistency weight decreases only slightly while the reward value of the response increases significantly, resulting in an improvement in the final training performance.
  • Figure 5: The impact of prefix token number on final performance, where the red line represents a Gaussian smoothing curve with a coefficient of 1.0.
  • ...and 1 more figures