Table of Contents
Fetching ...

Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples

Chengqian Gao, Haonan Li, Liu Liu, Zeke Xie, Peilin Zhao, Zhiqiang Xu

TL;DR

The paper argues that alignment data should be matched to model capacity, showing that overly difficult preference examples can harm alignment. It introduces a principled data-difficulty criterion and the Selective DPO method, which filters training data by estimated difficulty using validation loss proxies and multiple reference models. Across benchmarks, Selective DPO yields 9–16% improvements in win rates over standard DPO, with robust gains when training data are aligned to the model's capacity. The work suggests a shift from data quantity toward difficulty-aware data selection to improve LLM alignment and informs future RLHF-oriented strategies.

Abstract

The alignment of large language models (LLMs) often assumes that using more clean data yields better outcomes, overlooking the match between model capacity and example difficulty. Challenging this, we propose a new principle: Preference data vary in difficulty, and overly difficult examples hinder alignment, by exceeding the model's capacity. Through systematic experimentation, we validate this principle with three key findings: (1) preference examples vary in difficulty, as evidenced by consistent learning orders across alignment runs; (2) overly difficult examples significantly degrade performance across four LLMs and two datasets; and (3) the capacity of a model dictates its threshold for handling difficult examples, underscoring a critical relationship between data selection and model capacity. Building on this principle, we introduce Selective DPO, which filters out overly difficult examples. This simple adjustment improves alignment performance by 9-16% in win rates on the AlpacaEval 2 benchmark compared to the DPO baseline, suppressing a series of DPO variants with different algorithmic adjustments. Together, these results illuminate the importance of aligning data difficulty with model capacity, offering a transformative perspective for improving alignment strategies in LLMs. Code is available at https://github.com/glorgao/SelectiveDPO.

Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples

TL;DR

The paper argues that alignment data should be matched to model capacity, showing that overly difficult preference examples can harm alignment. It introduces a principled data-difficulty criterion and the Selective DPO method, which filters training data by estimated difficulty using validation loss proxies and multiple reference models. Across benchmarks, Selective DPO yields 9–16% improvements in win rates over standard DPO, with robust gains when training data are aligned to the model's capacity. The work suggests a shift from data quantity toward difficulty-aware data selection to improve LLM alignment and informs future RLHF-oriented strategies.

Abstract

The alignment of large language models (LLMs) often assumes that using more clean data yields better outcomes, overlooking the match between model capacity and example difficulty. Challenging this, we propose a new principle: Preference data vary in difficulty, and overly difficult examples hinder alignment, by exceeding the model's capacity. Through systematic experimentation, we validate this principle with three key findings: (1) preference examples vary in difficulty, as evidenced by consistent learning orders across alignment runs; (2) overly difficult examples significantly degrade performance across four LLMs and two datasets; and (3) the capacity of a model dictates its threshold for handling difficult examples, underscoring a critical relationship between data selection and model capacity. Building on this principle, we introduce Selective DPO, which filters out overly difficult examples. This simple adjustment improves alignment performance by 9-16% in win rates on the AlpacaEval 2 benchmark compared to the DPO baseline, suppressing a series of DPO variants with different algorithmic adjustments. Together, these results illuminate the importance of aligning data difficulty with model capacity, offering a transformative perspective for improving alignment strategies in LLMs. Code is available at https://github.com/glorgao/SelectiveDPO.

Paper Structure

This paper contains 50 sections, 4 equations, 14 figures, 11 tables, 1 algorithm.

Figures (14)

  • Figure 1: Overly difficult examples hinder the alignment. Training on difficult examples, identified by high validation loss, adversely affects alignment and decreases overall performance by 9.4% in win rate. The results are from experiments with four SFT models on the UltraFeedback-binarized dataset, i.e., Figure \ref{['fig:base-model-struggles']}.
  • Figure 2: Examples are learned in consistent orders across different runs of the same LLM, despite variations in the training data and random seeds. Left: The learned step (ranging from 1 to 948) represents the step at which the implicit reward model distinguishes between preferred and rejected responses (see Eq. (\ref{['eq:learned-step']}), threshold $\delta=0.4$). X-axis: 40 unique combinations of model size (4 total) and training data subset (10 per model). Y-axis: 300 test examples, sorted by average learned step across 40 runs. Color gradients encodes difficulty. Middle: Two Spearman's rank correlation matrices. Lower triangle: correlations of learned step across runs; upper triangle: validation loss correlations. Right: Two Jaccard similarity matrices for difficult examples (top 50%) defined by learned step and validation loss across runs.
  • Figure 3: Direct Preference Optimization (DPO) struggles with difficult examples, broadly and significantly. We present the defined WR$'$ evolution for four models trained on the argilla-mix-dpo-7k and ultrafeedback-binarized datasets. The results are based on checkpoints from three 1-eopch runs with different seeds. Random Ordering (DPO): Training data are presented in a randomized sequence. Sorted by VL (From Easy to Difficult): Training examples are ranked by their validation loss (VL) and presented from easy to difficult, following a curriculum learning approach. Selected by VL (Shuffled): The easiest 60% (for Argilla-7K) or 50% (for UF-binarized) of the data is selected based on VL, and examples are sampled in a random order for training. The VL measurements are displayed as bar plots. We include evaluation results (dashed lines) from the two corresponding DPO models released by meng2024simpo for reference.
  • Figure 4: Difficulty examples are not necessarily data errors.(a): flipping the last 40% examples with higher validation loss. (b): sorting the examples with the $\epsilon$-greedy sorting algorithm. In this case, each mini-batch data contains (1-$\epsilon$) part of easy-to-difficult examples and ($\epsilon$) part of randomly sampled examples. (c): increasing and decreasing the learning rate. All experiments are conducted on the Mistral-7B-SFT model with Argilla-dpo-mix-7k dataset.
  • Figure 5: Difficult examples benefit larger models with greater capacities. Examples are sorted by their validation loss, ranging from easy to difficult. We fit the measured WR$'$ (scatter points) using a second-degree polynomial (dashed line), identifying the peak of each parabola as the sweet spot (marker). Notably, larger models reach sweet spots at higher data percentages, indicating that model with greater capacity can manage more challenging examples. The results are from ten runs per model type, evaluated using ArmoRMwang2024interpretable.
  • ...and 9 more figures

Theorems & Definitions (3)

  • Definition 3.1: Difficult example
  • Remark 3.2
  • Remark 5.1: Flexible hyper-parameter $\tau$