Less is More: Improving LLM Alignment via Preference Data Selection

Xun Deng; Han Zhong; Rui Ai; Fuli Feng; Zheng Wang; Xiangnan He

Less is More: Improving LLM Alignment via Preference Data Selection

Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, Xiangnan He

TL;DR

This work tackles the data-quality bottleneck in Direct Preference Optimization (DPO) for aligning large language models. By introducing BeeS, a margin-maximization data-selection and Bayesian-margin-aggregation framework, it mitigates parameter shrinkage from noisy reward signals and robustly aggregates multiple margin sources (external and implicit). The approach yields strong data-efficiency, achieving 3–8 percentage-point gains on AlpacaEval2 with only about 10% of UltraFeedback data and extending effectively to iterative DPO with online data, suggesting substantial practical savings in computation and labeling. Theoretical analysis links margin, noise, and parameter behavior, while experiments across multiple model families and datasets demonstrate consistent improvements and generalization to new architectures and learning algorithms. Overall, the work emphasizes data curation as a pivotal lever for improving preference optimization in LLM alignment.

Abstract

Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from the largely overlooked but critical aspect of data selection. Specifically, we address the issue of parameter shrinkage caused by noisy data by proposing a novel margin-maximization principle for dataset curation in DPO training. To further mitigate the noise in different reward models, we propose a Bayesian Aggregation approach that unifies multiple margin sources (external and implicit) into a single preference probability. Extensive experiments in diverse settings demonstrate the consistently high data efficiency of our approach. Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach achieves 3\% to 8\% improvements across various Llama, Mistral, and Qwen models on the AlpacaEval2 benchmark. Furthermore, our approach seamlessly extends to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, revealing the high redundancy in this presumed high-quality data construction manner. These results highlight the potential of data selection strategies for advancing preference optimization.

Less is More: Improving LLM Alignment via Preference Data Selection

TL;DR

Abstract

Less is More: Improving LLM Alignment via Preference Data Selection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)