InCo-DPO: Balancing Distribution Shift and Data Quality for Enhanced Preference Optimization
Yunan Wang, Jijie Li, Bo-Wen Zhang, Liangdong Wang, Guang Liu
TL;DR
This work tackles Direct Preference Optimization (DPO) for aligning LLMs by highlighting data quality as a key factor alongside distribution shift. It introduces InCo-DPO, a continuation-based data synthesis method that combines on-policy and off-policy samples via a tunable prefix from a strong model, enabling dynamic trade-offs between reward and policy consistency. Empirical results across Alpaca Eval 2.0, Arena-Hard, and multiple open-source models show InCo-DPO outperforms pure on-policy and off-policy data and achieves state-of-the-art win rates (e.g., 60.8 on Arena-Hard with Gemma-2). The approach advances data-efficient, stable preference optimization and offers broad applicability across datasets and model families, though future work should extend safety-focused evaluation and multi-turn alignment.
Abstract
Direct Preference Optimization (DPO) optimizes language models to align with human preferences. Utilizing on-policy samples, generated directly by the policy model, typically results in better performance due to its distribution consistency with the model compared to off-policy samples. This paper identifies the quality of candidate preference samples as another critical factor. While the quality of on-policy data is inherently constrained by the capabilities of the policy model, off-policy data, which can be derived from diverse sources, offers greater potential for quality despite experiencing distribution shifts. However, current research mostly relies on on-policy data and neglects the value of off-policy data in terms of data quality, due to the challenge posed by distribution shift. In this paper, we propose InCo-DPO, an efficient method for synthesizing preference data by integrating on-policy and off-policy data, allowing dynamic adjustments to balance distribution shifts and data quality, thus finding an optimal trade-off. Consequently, InCo-DPO overcomes the limitations of distribution shifts in off-policy data and the quality constraints of on-policy data. We evaluated InCo-DPO with the Alpaca-Eval 2.0 and Arena-Hard benchmarks. Experimental results demonstrate that our approach not only outperforms both on-policy and off-policy data but also achieves a state-of-the-art win rate of 60.8 on Arena-Hard with the vanilla DPO using Gemma-2 model.
