Table of Contents
Fetching ...

Less is More: Improving LLM Alignment via Preference Data Selection

Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, Xiangnan He

TL;DR

This work tackles the data-quality bottleneck in Direct Preference Optimization (DPO) for aligning large language models. By introducing BeeS, a margin-maximization data-selection and Bayesian-margin-aggregation framework, it mitigates parameter shrinkage from noisy reward signals and robustly aggregates multiple margin sources (external and implicit). The approach yields strong data-efficiency, achieving 3–8 percentage-point gains on AlpacaEval2 with only about 10% of UltraFeedback data and extending effectively to iterative DPO with online data, suggesting substantial practical savings in computation and labeling. Theoretical analysis links margin, noise, and parameter behavior, while experiments across multiple model families and datasets demonstrate consistent improvements and generalization to new architectures and learning algorithms. Overall, the work emphasizes data curation as a pivotal lever for improving preference optimization in LLM alignment.

Abstract

Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from the largely overlooked but critical aspect of data selection. Specifically, we address the issue of parameter shrinkage caused by noisy data by proposing a novel margin-maximization principle for dataset curation in DPO training. To further mitigate the noise in different reward models, we propose a Bayesian Aggregation approach that unifies multiple margin sources (external and implicit) into a single preference probability. Extensive experiments in diverse settings demonstrate the consistently high data efficiency of our approach. Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach achieves 3\% to 8\% improvements across various Llama, Mistral, and Qwen models on the AlpacaEval2 benchmark. Furthermore, our approach seamlessly extends to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, revealing the high redundancy in this presumed high-quality data construction manner. These results highlight the potential of data selection strategies for advancing preference optimization.

Less is More: Improving LLM Alignment via Preference Data Selection

TL;DR

This work tackles the data-quality bottleneck in Direct Preference Optimization (DPO) for aligning large language models. By introducing BeeS, a margin-maximization data-selection and Bayesian-margin-aggregation framework, it mitigates parameter shrinkage from noisy reward signals and robustly aggregates multiple margin sources (external and implicit). The approach yields strong data-efficiency, achieving 3–8 percentage-point gains on AlpacaEval2 with only about 10% of UltraFeedback data and extending effectively to iterative DPO with online data, suggesting substantial practical savings in computation and labeling. Theoretical analysis links margin, noise, and parameter behavior, while experiments across multiple model families and datasets demonstrate consistent improvements and generalization to new architectures and learning algorithms. Overall, the work emphasizes data curation as a pivotal lever for improving preference optimization in LLM alignment.

Abstract

Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from the largely overlooked but critical aspect of data selection. Specifically, we address the issue of parameter shrinkage caused by noisy data by proposing a novel margin-maximization principle for dataset curation in DPO training. To further mitigate the noise in different reward models, we propose a Bayesian Aggregation approach that unifies multiple margin sources (external and implicit) into a single preference probability. Extensive experiments in diverse settings demonstrate the consistently high data efficiency of our approach. Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach achieves 3\% to 8\% improvements across various Llama, Mistral, and Qwen models on the AlpacaEval2 benchmark. Furthermore, our approach seamlessly extends to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, revealing the high redundancy in this presumed high-quality data construction manner. These results highlight the potential of data selection strategies for advancing preference optimization.

Paper Structure

This paper contains 30 sections, 8 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: The workflow of the BeeS method.
  • Figure 2: Visualization of joint margin distribution on UltraFeedback. (Left) Joint distribution of external and implicit reward margin values. (Middle) Joint distribution of implicit reward margins computed using models of 1B and 3B scales. (Right) Joint distribution of two different external reward margin values on online-generated data.
  • Figure 3: DPO training loss and margin of Llama-3.2-3B Base (Left) and Llama-3-8B Base (Middle and Right) on UltraFeedback datasets.
  • Figure 4: AlpacaEval 2.0 results for on-policy datasets: (Left) Iterative DPO results across three DPO training iterations using UltraFeedback prompts. (Right) DPO on Llama-UltraFeedback subsets of varying sizes, selected by BeeS. Results of DPO-variants trained on fullset are also compared.
  • Figure 5: Ablation Study: (Left) different model choices (Mistral-7B-Instruct-v-0.2, Qwen-2.5-Instruct-7B and Qwen-2.5-Instruct-14B). BeeS selects a 6k-sample subset for training. (Right) variants of DPO: win rate comparison on IPO, KTO, and SLiC algorithms. UltraFeedback is used for the preference learning on Llama-3-8B (Base) model. Rand and BeeS select a 6k-sample subset.
  • ...and 8 more figures