Table of Contents
Fetching ...

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

Junshu Pan, Wei Shen, Shulin Huang, Qiji Zhou, Yue Zhang

TL;DR

Pre-DPO tackles inefficiencies in offline preference optimization by replacing the conventional fixed reference with a guiding reference model derived from an initial optimization pass. This forecas t via the guiding reference reweights training data adaptively, enabling more effective policy improvement with DPO or SimPO and yielding consistent gains on AlpacaEval 2 and Arena-Hard v0.1 without external data. The approach is validated across multiple model families and settings, and is shown to be compatible with iterative preference optimization, improve data utilization, and maintain reasonable response lengths. Overall, the work introduces a practical, plug-in enhancement to RLHF pipelines that improves data efficiency and performance by transforming the reference from a constraint into an informed guide.

Abstract

Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training, the reference model plays the role of a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the lack of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and necessitates stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model. This reference model provides foresight into the optimal policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

TL;DR

Pre-DPO tackles inefficiencies in offline preference optimization by replacing the conventional fixed reference with a guiding reference model derived from an initial optimization pass. This forecas t via the guiding reference reweights training data adaptively, enabling more effective policy improvement with DPO or SimPO and yielding consistent gains on AlpacaEval 2 and Arena-Hard v0.1 without external data. The approach is validated across multiple model families and settings, and is shown to be compatible with iterative preference optimization, improve data utilization, and maintain reasonable response lengths. Overall, the work introduces a practical, plug-in enhancement to RLHF pipelines that improves data efficiency and performance by transforming the reference from a constraint into an informed guide.

Abstract

Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training, the reference model plays the role of a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the lack of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and necessitates stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model. This reference model provides foresight into the optimal policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.

Paper Structure

This paper contains 34 sections, 16 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Pre-DPO introduces a guiding reference model derived from the optimized policy to guide re-optimization, transforming the reference from a constraint into an informed guide with foresight.
  • Figure 2: An overview of Pre-DPO. DPO constrains training using the initial policy model as the reference, while SimPO is reference-free. Pre-DPO first optimizes a policy model using DPO or SimPO, then resets it as a guiding reference model, and re-optimizes the initial policy using DPO. This process enhances data utilization and results in a better-optimized policy model.
  • Figure 3: $\lambda$ distribution dynamics of DPO, TR-DPO, and Pre-DPO under the Llama3.2-3B-Base setting. Pre-DPO maintains a broader distribution during the entire training.
  • Figure 4: Quantitative analysis of the $\lambda$ distribution during training for DPO, TR-DPO (hard update), and Pre-DPO under the Llama3.2-3B-Base setting. Numerical values on top of the bars indicate the corresponding percentages.