Table of Contents
Fetching ...

How Well Can Preference Optimization Generalize Under Noisy Feedback?

Shawn Im, Sharon Li

TL;DR

This work analyzes how noisy human feedback affects generalization in preference optimization for aligning large language models. It develops a finite-step, boundary-dynamics framework around Generalized Preference Optimization (GPO), incorporating $\epsilon$-mislabeled and $\omega$-uncertain noise models and a von Mises–Fisher data assumption to characterize robustness. A probabilistic generalization guarantee shows population risk remains exponentially small when data are well-separated and sample size is sufficient, with explicit dependence on noise rate $\epsilon$, concentration $\gamma$, and separation $\phi$, and an inflection near $\epsilon=\tfrac{1}{2}$. The theory is validated on controlled simulations and real Anthropic data, offering practical diagnostics and guidelines for robust preference alignment under noisy feedback.

Abstract

As large language models (LLMs) advance their capabilities, aligning these models with human preferences has become crucial. Preference optimization, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for aligning LLMs. However, most existing works assume noise-free feedback, which is unrealistic due to the inherent errors and inconsistencies in human judgments. This paper addresses the impact of noisy feedback on preference optimization, providing generalization guarantees under these conditions. In particular, we consider noise models that correspond to common real-world sources of noise, such as mislabeling and uncertainty. Unlike traditional analyses that assume convergence, our work focuses on finite-step preference optimization, offering new insights that are more aligned with practical LLM training. We describe how generalization decays with different types of noise across levels of noise rates based on the preference data distribution and number of samples. Our analysis for noisy preference learning applies to a broad family of preference optimization losses such as DPO, IPO, SLiC, etc. Empirical validation on contemporary LLMs confirms the practical relevance of our findings, offering valuable insights for developing AI systems that align with human preferences.

How Well Can Preference Optimization Generalize Under Noisy Feedback?

TL;DR

This work analyzes how noisy human feedback affects generalization in preference optimization for aligning large language models. It develops a finite-step, boundary-dynamics framework around Generalized Preference Optimization (GPO), incorporating -mislabeled and -uncertain noise models and a von Mises–Fisher data assumption to characterize robustness. A probabilistic generalization guarantee shows population risk remains exponentially small when data are well-separated and sample size is sufficient, with explicit dependence on noise rate , concentration , and separation , and an inflection near . The theory is validated on controlled simulations and real Anthropic data, offering practical diagnostics and guidelines for robust preference alignment under noisy feedback.

Abstract

As large language models (LLMs) advance their capabilities, aligning these models with human preferences has become crucial. Preference optimization, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for aligning LLMs. However, most existing works assume noise-free feedback, which is unrealistic due to the inherent errors and inconsistencies in human judgments. This paper addresses the impact of noisy feedback on preference optimization, providing generalization guarantees under these conditions. In particular, we consider noise models that correspond to common real-world sources of noise, such as mislabeling and uncertainty. Unlike traditional analyses that assume convergence, our work focuses on finite-step preference optimization, offering new insights that are more aligned with practical LLM training. We describe how generalization decays with different types of noise across levels of noise rates based on the preference data distribution and number of samples. Our analysis for noisy preference learning applies to a broad family of preference optimization losses such as DPO, IPO, SLiC, etc. Empirical validation on contemporary LLMs confirms the practical relevance of our findings, offering valuable insights for developing AI systems that align with human preferences.

Paper Structure

This paper contains 36 sections, 7 theorems, 97 equations, 6 figures, 2 tables.

Key Result

Theorem 3.3

(Generalization guarantee under noisy feedback)(Informal version of Theorem thm:gpo-noise-gen-app) Under the setup described above, suppose we have an $\epsilon$-mislabeled dataset with $N \geq 25$ samples and $d \geq 64$, $\phi$ sufficiently large and satisfying $0 \leq \tan \phi \leq \sqrt{\log N} the population risk is bounded as for a constant $c > 0$. Additionally, for any $N, \gamma, \phi$,

Figures (6)

  • Figure 1: Overview of our work. The model is trained using the Generalized Preference Optimization loss with noisy human preference data (highlighted in red). Our framework provides theoretical guarantees on the generalization performance of the learned preference model.
  • Figure 2: Test accuracy when trained on Anthropic dataset (behavior "willingness to make acausal trades") with varying noise levels $\epsilon$. We train Llama-3.1-8B using DPO, IPO, and SLiC and average the accuracy over 10 runs.
  • Figure 3: Visualization of vMF distributions corresponding to positive and negative examples on the 3-D sphere.
  • Figure 4: Empirical validation in a controlled setting using the DPO(a), (d), IPO (b), (e), and SLiC (c), (f) with (top) concentration parameter $\gamma$ varying over $1/8, 1/4, 1/2, 1, 2$ and with (bottom) $N$ varying over $200, 600, 2000$. We vary the noise rate $\epsilon$ on the $x$-axis from 0 to 0.5 with increments of 0.025. All curves are averaged over 100 runs.
  • Figure 5: Empirical validation on the Anthropic datasets with varying preference separation using the DPO, IPO, and SLiC losses. Behavior 1 ("desire to remove safety precautions") is less separated, and Behavior 2 ("willingness to make acausal trades") is more separated. The top row corresponds to mislabeled datasets and the bottom row corresponds to uncertain datasets. All curves are averaged over 10 runs.
  • ...and 1 more figures

Theorems & Definitions (11)

  • Definition 2.1: Preference data
  • Definition 2.2: Population risk of preference learning
  • Definition 3.1: $\epsilon$-mislabeled preference data
  • Definition 3.2: $\omega$-uncertain preference data
  • Theorem 3.3
  • Lemma B.1: Gradient flow and reward dynamics
  • Lemma B.2: Mislabeled von Mises-Fisher Directional Concentration
  • Lemma B.3: Training Boundary Shift
  • Theorem B.4
  • Lemma C.1
  • ...and 1 more