Table of Contents
Fetching ...

Preference Optimization with Multi-Sample Comparisons

Chaoqi Wang, Zhuokai Zhao, Chen Zhu, Karthik Abinav Sankararaman, Michal Valko, Xuefei Cao, Zhaorun Chen, Madian Khabsa, Yuxin Chen, Hao Ma, Sinong Wang

TL;DR

This work tackles alignment of generative models by moving beyond single-sample preferences to multi-sample distributions. It introduces mDPO and mIPO, extensions of DPO and IPO that optimize group-wise and distributional characteristics by comparing sample groups rather than individual outputs. Across RNG tasks, diffusion-image debiasing, and fiction generation, multi-sample methods yield improved diversity, reduced biases, and better output quality, with added robustness to label noise and noisy datasets. The paper also provides a low-variance unbiased estimator for mIPO and demonstrates practical effectiveness, underscoring the value of distributional preference optimization for robust alignment in large-scale models.

Abstract

Recent advancements in generative models, particularly large language models (LLMs) and diffusion models, have been driven by extensive pretraining on large datasets followed by post-training. However, current post-training methods such as reinforcement learning from human feedback (RLHF) and direct alignment from preference methods (DAP) primarily utilize single-sample comparisons. These approaches often fail to capture critical characteristics such as generative diversity and bias, which are more accurately assessed through multiple samples. To address these limitations, we introduce a novel approach that extends post-training to include multi-sample comparisons. To achieve this, we propose Multi-sample Direct Preference Optimization (mDPO) and Multi-sample Identity Preference Optimization (mIPO). These methods improve traditional DAP methods by focusing on group-wise characteristics. Empirically, we demonstrate that multi-sample comparison is more effective in optimizing collective characteristics~(e.g., diversity and bias) for generative models than single-sample comparison. Additionally, our findings suggest that multi-sample comparisons provide a more robust optimization framework, particularly for dataset with label noise.

Preference Optimization with Multi-Sample Comparisons

TL;DR

This work tackles alignment of generative models by moving beyond single-sample preferences to multi-sample distributions. It introduces mDPO and mIPO, extensions of DPO and IPO that optimize group-wise and distributional characteristics by comparing sample groups rather than individual outputs. Across RNG tasks, diffusion-image debiasing, and fiction generation, multi-sample methods yield improved diversity, reduced biases, and better output quality, with added robustness to label noise and noisy datasets. The paper also provides a low-variance unbiased estimator for mIPO and demonstrates practical effectiveness, underscoring the value of distributional preference optimization for robust alignment in large-scale models.

Abstract

Recent advancements in generative models, particularly large language models (LLMs) and diffusion models, have been driven by extensive pretraining on large datasets followed by post-training. However, current post-training methods such as reinforcement learning from human feedback (RLHF) and direct alignment from preference methods (DAP) primarily utilize single-sample comparisons. These approaches often fail to capture critical characteristics such as generative diversity and bias, which are more accurately assessed through multiple samples. To address these limitations, we introduce a novel approach that extends post-training to include multi-sample comparisons. To achieve this, we propose Multi-sample Direct Preference Optimization (mDPO) and Multi-sample Identity Preference Optimization (mIPO). These methods improve traditional DAP methods by focusing on group-wise characteristics. Empirically, we demonstrate that multi-sample comparison is more effective in optimizing collective characteristics~(e.g., diversity and bias) for generative models than single-sample comparison. Additionally, our findings suggest that multi-sample comparisons provide a more robust optimization framework, particularly for dataset with label noise.

Paper Structure

This paper contains 29 sections, 4 theorems, 41 equations, 16 figures, 6 tables.

Key Result

Proposition 1

Let $f: \mathcal{X} \to \mathbb{R}$ be a measurable function, and let $p$ and $q$ be probability distributions on $\mathcal{X}$. Define $\ell = (\mathbb{E}_{x\sim p}[f(x)] - \mathbb{E}_{x\sim q}[f(x)] - c)^2$, where $c$ is a constant. Let $x^p_1, \ldots, x^p_n$ be i.i.d. samples from $p$, and $x^q_1 is an unbiased estimator of $\ell$, where $\hat{\sigma}_p^2$ and $\hat{\sigma}_q^2$ are the sample

Figures (16)

  • Figure 1: Top: Diversity of responses from two groups for improving urban transportation. The left group provides a broader range of approaches, including public transit and infrastructure improvements. The right group focuses more narrowly on specific technological and management solutions. Bottom: Bias in images from two groups. The left group displays a more balanced representation of race and gender, while the right group predominantly features males due to stereotypes.
  • Figure 2: Biased estimator vs. Unbiased estimator.
  • Figure 3: Distance comparison.
  • Figure 4: Diversity in fiction generation using the same model (Llama 3-8B) finetuned with different approaches, assessed through genre distribution. Left: mDPO and DPO. Right: mIPO and IPO. The KL-divergences between different genre distributions and the uniform distribution are (smaller is better, and the best ones are highlighted in bold font.) DPO: $0.170$; mDPO ($k=3$): $\mathbf{0.126}$; mDPO ($k=5$): $0.142$; IPO: $0.125$; mIPO ($k=3$): $0.094$; mIPO ($k=5$): $\mathbf{0.050}$.
  • Figure 5: mDPO and mIPO versus DPO and IPO on Alpaca Evals using GPT-4o evaluation.
  • ...and 11 more figures

Theorems & Definitions (9)

  • Proposition 1
  • Proposition 2
  • Remark 1
  • Proposition 2
  • proof
  • Proposition 2
  • proof
  • Remark 1
  • proof