Table of Contents
Fetching ...

Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization

Zixuan Huang, Yikun Ban, Lean Fu, Xiaojie Li, Zhongxiang Dai, Jianxin Li, Deqing Wang

TL;DR

This work addresses the data-quality bottleneck in Direct Preference Optimization (DPO) by introducing SamS, a lightweight, batch-wise sample scheduler that dynamically selects training samples based on the evolving internal states of the language model, cast as a contextual bandit with a lagged training update and an auxiliary exploration network. SamS computes a composite reward from batch-level learning progress and per-sample uncertainty and preference margins, guiding adaptive subset selection without modifying the core DPO objective. Empirically, integrating SamS into DPO yields consistent improvements across AlpacaEval 2 and MT-Bench, demonstrates robustness to label noise, and reduces memory overhead compared with data pre-selection baselines. The approach promises to generalize to RLHF and other supervised learning settings, enabling more efficient and stable alignment with human preferences.

Abstract

Direct Preference Optimization (DPO) has emerged as an effective approach for aligning large language models (LLMs) with human preferences. However, its performance is highly dependent on the quality of the underlying human preference data. To address this bottleneck, prior work has explored various data selection strategies, but these methods often overlook the impact of the evolving states of the language model during the optimization process. In this paper, we introduce a novel problem: Sample Scheduling for DPO, which aims to dynamically and adaptively schedule training samples based on the model's evolving batch-wise states throughout preference optimization. To solve this problem, we propose SamS, an efficient and effective algorithm that adaptively selects samples in each training batch based on the LLM's learning feedback to maximize the potential generalization performance. Notably, without modifying the core DPO algorithm, simply integrating SamS significantly improves performance across tasks, with minimal additional computational overhead. This work points to a promising new direction for improving LLM alignment through batch-wise sample selection, with potential generalization to RLHF and broader supervised learning paradigms.

Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization

TL;DR

This work addresses the data-quality bottleneck in Direct Preference Optimization (DPO) by introducing SamS, a lightweight, batch-wise sample scheduler that dynamically selects training samples based on the evolving internal states of the language model, cast as a contextual bandit with a lagged training update and an auxiliary exploration network. SamS computes a composite reward from batch-level learning progress and per-sample uncertainty and preference margins, guiding adaptive subset selection without modifying the core DPO objective. Empirically, integrating SamS into DPO yields consistent improvements across AlpacaEval 2 and MT-Bench, demonstrates robustness to label noise, and reduces memory overhead compared with data pre-selection baselines. The approach promises to generalize to RLHF and other supervised learning settings, enabling more efficient and stable alignment with human preferences.

Abstract

Direct Preference Optimization (DPO) has emerged as an effective approach for aligning large language models (LLMs) with human preferences. However, its performance is highly dependent on the quality of the underlying human preference data. To address this bottleneck, prior work has explored various data selection strategies, but these methods often overlook the impact of the evolving states of the language model during the optimization process. In this paper, we introduce a novel problem: Sample Scheduling for DPO, which aims to dynamically and adaptively schedule training samples based on the model's evolving batch-wise states throughout preference optimization. To solve this problem, we propose SamS, an efficient and effective algorithm that adaptively selects samples in each training batch based on the LLM's learning feedback to maximize the potential generalization performance. Notably, without modifying the core DPO algorithm, simply integrating SamS significantly improves performance across tasks, with minimal additional computational overhead. This work points to a promising new direction for improving LLM alignment through batch-wise sample selection, with potential generalization to RLHF and broader supervised learning paradigms.

Paper Structure

This paper contains 40 sections, 14 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: The Study of challenges in SamS. (a) Varying learning difficulties for different model states. For the same 16 samples, we track their DPO loss across different states of the language model, from training step 0 to step 30,000. We use the relative DPO loss of each sample as the difficulty measure gao2025principled. (b) Noisy data degrades DPO performance. During preference optimization using Pythia-2.8B biderman2023pythia on the Anthropic-HH dataset bai2022training, we artificially injected 20% noise into the preference labels. As a result, the performance of DPO dropped significantly, highlighting its sensitivity to data quality.
  • Figure 2: (Left side) Overview of a standard DPO framework integrated with SamS. (Right side) The architecture of the Scheduler. The Scheduler initially treats the policy's hidden state sequence as the arm context for each sample. The Encoder aggregates the state information of each sample to encode the arm context. Subsequently, the Exploitation-Exploration Network utilizes the encoded arm contexts to estimate reward values for each sample, which is used to select a Top-K subset for policy learning.
  • Figure 3: Robustness Testing of SamS: DPO vs. DPO+SamS (Test Accuracy).
  • Figure 4: Computational cost of DPO vs. DPO+SamS: similar runtime and 18% less GPU memory usage.
  • Figure 5: A comparison of different scheduler selection ratios in SamS reveals that 75% outperforms 50%, which in turn surpasses 100%, followed by 25%.