Table of Contents
Fetching ...

Is On-Policy Data always the Best Choice for Direct Preference Optimization-based LM Alignment?

Zetian Sun, Dongfang Li, Xuhui Chen, Baotian Hu, Min Zhang

TL;DR

This paper challenges the default assumption that on-policy data is always superior for LM alignment with human preferences. It introduces an alignment stage hypothesis, proposing a two-stage process—preference injection (diverse data) and preference fine-tuning (high-quality data)—and develops a boundary-measurement algorithm to detect stage transitions during training. The authors provide theoretical analysis linking DPO objectives to alignment objectives and define a practical, BT-based notion of preference consistency to estimate the general text distribution. Empirically, they show that the effectiveness of on-policy versus off-policy data varies across models and initial conditions, and that the boundary-measurement approach generalizes across multiple LMs and an alternative method (SLiC-HF). The work yields actionable guidance for data selection to improve efficiency and reliability in LM alignment, highlighting that strategic, stage-aware data blending can outperform naïve on-policy-only strategies.

Abstract

The alignment of language models~(LMs) with human preferences is critical for building reliable AI systems. The problem is typically framed as optimizing an LM policy to maximize the expected reward that reflects human preferences. Recently, Direct Preference Optimization~(DPO) was proposed as a LM alignment method that directly optimize the policy from static preference data, and further improved by incorporating on-policy sampling~(i.e., preference candidates generated during the training loop) for better LM alignment. However, we show on-policy data is not always optimal, with systematic effectiveness difference emerging between static and on-policy preference candidates. For example, on-policy data can result in a $3\times$ effectiveness compared with static data for Llama-3, and a $0.4\times$ effectiveness for Zephyr. To explain the phenomenon, we propose the alignment stage assumption, which divides the alignment process into two distinct stages: the preference injection stage, which benefits from diverse data, and the preference fine-tuning stage, which favors high-quality data. Through theoretical and empirical analysis, we characterize these stages and propose an effective algorithm to identify the boundaries between them. We perform experiments on $5$ models~(Llama, Zephyr, Phi-2, Qwen, Pythia) and $2$ alignment methods~(DPO, SLiC-HF) to show the generalizability of alignment stage assumption and the effectiveness of the boundary measurement algorithm.

Is On-Policy Data always the Best Choice for Direct Preference Optimization-based LM Alignment?

TL;DR

This paper challenges the default assumption that on-policy data is always superior for LM alignment with human preferences. It introduces an alignment stage hypothesis, proposing a two-stage process—preference injection (diverse data) and preference fine-tuning (high-quality data)—and develops a boundary-measurement algorithm to detect stage transitions during training. The authors provide theoretical analysis linking DPO objectives to alignment objectives and define a practical, BT-based notion of preference consistency to estimate the general text distribution. Empirically, they show that the effectiveness of on-policy versus off-policy data varies across models and initial conditions, and that the boundary-measurement approach generalizes across multiple LMs and an alternative method (SLiC-HF). The work yields actionable guidance for data selection to improve efficiency and reliability in LM alignment, highlighting that strategic, stage-aware data blending can outperform naïve on-policy-only strategies.

Abstract

The alignment of language models~(LMs) with human preferences is critical for building reliable AI systems. The problem is typically framed as optimizing an LM policy to maximize the expected reward that reflects human preferences. Recently, Direct Preference Optimization~(DPO) was proposed as a LM alignment method that directly optimize the policy from static preference data, and further improved by incorporating on-policy sampling~(i.e., preference candidates generated during the training loop) for better LM alignment. However, we show on-policy data is not always optimal, with systematic effectiveness difference emerging between static and on-policy preference candidates. For example, on-policy data can result in a effectiveness compared with static data for Llama-3, and a effectiveness for Zephyr. To explain the phenomenon, we propose the alignment stage assumption, which divides the alignment process into two distinct stages: the preference injection stage, which benefits from diverse data, and the preference fine-tuning stage, which favors high-quality data. Through theoretical and empirical analysis, we characterize these stages and propose an effective algorithm to identify the boundaries between them. We perform experiments on models~(Llama, Zephyr, Phi-2, Qwen, Pythia) and alignment methods~(DPO, SLiC-HF) to show the generalizability of alignment stage assumption and the effectiveness of the boundary measurement algorithm.

Paper Structure

This paper contains 54 sections, 3 theorems, 22 equations, 4 figures, 10 tables, 1 algorithm.

Key Result

Theorem 5.1

(Bijection between reward function and policy) Under mild assumption, for any policy $\pi_\theta$ and the static reference model $\pi_{\rm ref}$, there exists a unique reward model $r_\phi$ satisfying $\pi_\theta$ being the optimal solution of Eq. (eqn:align_obj).

Figures (4)

  • Figure 1: Illustration of our alignment stage assumption and different characteristics of (a) preference injection stage and (b) preference fine-tuning stage. The alignment area indicates the preferred region of preference candidates at corresponding alignment stages. The stage boundary is estimated by the distance between ground truth text distribution ($\pi_{G})$ and simulated text distribution ($\pi_{\theta_1},\pi_{\theta_2}$).
  • Figure 2: Illustration of the alignment stage assumption. The alignment process is a continuous transition from preference injection stage to preference fine-tuning stage. We demonstrate the characteristics of stages (Case 1 and Case 2). We build up the relationship among preference distribution, reward model and text distribution, which help us understand the alignment process from the perspective of distribution distance and preference consistency. Practically, we propose the boundary measurement, a measurement to decide which stage the policy is currently in by judging which distribution ($\pi_{\rm off}$ and $\pi_\theta$) is a better estimation of the ground-truth distribution ($\pi_G$).
  • Figure 3: The intra-diversity between $\rm PC_{off}$ and $\rm PC_{llama}$ that is defined by the difference($\Delta$) of log probabilities between the chosen and the rejected answer cross different models, inclding Zephyr-7B, Qwen2.5-1.5B and Qwen3-4B. The curves of $\rm PC_{on}$ are also included as reference.
  • Figure 4: Illustration of the relationship between the general text distribution estimation and our boundary measurement algorithm discussed in §\ref{['sec:general_estimation']} and §\ref{['sec:practical']}. The boundary measurement algorithm is derived from preference consistency measurement. The preference consistency measurement is purposed for estimating the consistency between two preference distributions, which are defined as proxies towards the intractable text distribution.

Theorems & Definitions (8)

  • Theorem 5.1
  • Theorem 5.2
  • Definition 5.3
  • Theorem 5.4
  • Definition 5.5
  • proof
  • proof
  • proof