Preference Optimization with Multi-Sample Comparisons
Chaoqi Wang, Zhuokai Zhao, Chen Zhu, Karthik Abinav Sankararaman, Michal Valko, Xuefei Cao, Zhaorun Chen, Madian Khabsa, Yuxin Chen, Hao Ma, Sinong Wang
TL;DR
This work tackles alignment of generative models by moving beyond single-sample preferences to multi-sample distributions. It introduces mDPO and mIPO, extensions of DPO and IPO that optimize group-wise and distributional characteristics by comparing sample groups rather than individual outputs. Across RNG tasks, diffusion-image debiasing, and fiction generation, multi-sample methods yield improved diversity, reduced biases, and better output quality, with added robustness to label noise and noisy datasets. The paper also provides a low-variance unbiased estimator for mIPO and demonstrates practical effectiveness, underscoring the value of distributional preference optimization for robust alignment in large-scale models.
Abstract
Recent advancements in generative models, particularly large language models (LLMs) and diffusion models, have been driven by extensive pretraining on large datasets followed by post-training. However, current post-training methods such as reinforcement learning from human feedback (RLHF) and direct alignment from preference methods (DAP) primarily utilize single-sample comparisons. These approaches often fail to capture critical characteristics such as generative diversity and bias, which are more accurately assessed through multiple samples. To address these limitations, we introduce a novel approach that extends post-training to include multi-sample comparisons. To achieve this, we propose Multi-sample Direct Preference Optimization (mDPO) and Multi-sample Identity Preference Optimization (mIPO). These methods improve traditional DAP methods by focusing on group-wise characteristics. Empirically, we demonstrate that multi-sample comparison is more effective in optimizing collective characteristics~(e.g., diversity and bias) for generative models than single-sample comparison. Additionally, our findings suggest that multi-sample comparisons provide a more robust optimization framework, particularly for dataset with label noise.
