Table of Contents
Fetching ...

Can DPO Learn Diverse Human Values? A Theoretical Scaling Law

Shawn Im, Sharon Li

TL;DR

This work tackles how diversity in human values affects generalization in preference learning for LLMs trained with Direct Preference Optimization (DPO). It introduces a theoretical framework that models preferences as a mixture of value-cluster distributions and analyzes finite-step training via reward-margin dynamics, deriving bounds and a scaling law stating that the required samples per value grow as $Θ(\log K)$ with the number of distinct values. The results connect training dynamics to generalization performance and are supported by empirical validation across contemporary LLMs and preference datasets, illustrating practical implications for data collection and model alignment in pluralistic settings. The framework extends to multi-token generation and generalizes to other preference objectives, offering a principled lens on failure modes and future directions for robust, value-diverse alignment.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities but often struggle to align with human preferences, leading to harmful or undesirable outputs. Preference learning, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for ensuring that LLMs align with human values. An essential part of ensuring that LLMs are aligned for all people is accounting for a diverse set of values. This paper introduces a new theoretical framework to analyze how generalization scales with value diversity and sample quantity in models trained with direct preference optimization. Our framework rigorously assesses how well models generalize after a finite number of gradient steps, reflecting real-world LLM training practices. By analyzing the reward margin associated with each sample and its trajectory throughout training, we provide a bound on the generalization error that demonstrates the challenges of effectively learning a wide set of concepts or values. These insights are empirically validated on contemporary LLMs, underscoring the practical relevance of our theory.

Can DPO Learn Diverse Human Values? A Theoretical Scaling Law

TL;DR

This work tackles how diversity in human values affects generalization in preference learning for LLMs trained with Direct Preference Optimization (DPO). It introduces a theoretical framework that models preferences as a mixture of value-cluster distributions and analyzes finite-step training via reward-margin dynamics, deriving bounds and a scaling law stating that the required samples per value grow as with the number of distinct values. The results connect training dynamics to generalization performance and are supported by empirical validation across contemporary LLMs and preference datasets, illustrating practical implications for data collection and model alignment in pluralistic settings. The framework extends to multi-token generation and generalizes to other preference objectives, offering a principled lens on failure modes and future directions for robust, value-diverse alignment.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities but often struggle to align with human preferences, leading to harmful or undesirable outputs. Preference learning, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for ensuring that LLMs align with human values. An essential part of ensuring that LLMs are aligned for all people is accounting for a diverse set of values. This paper introduces a new theoretical framework to analyze how generalization scales with value diversity and sample quantity in models trained with direct preference optimization. Our framework rigorously assesses how well models generalize after a finite number of gradient steps, reflecting real-world LLM training practices. By analyzing the reward margin associated with each sample and its trajectory throughout training, we provide a bound on the generalization error that demonstrates the challenges of effectively learning a wide set of concepts or values. These insights are empirically validated on contemporary LLMs, underscoring the practical relevance of our theory.
Paper Structure (52 sections, 12 theorems, 114 equations, 11 figures, 2 tables)

This paper contains 52 sections, 12 theorems, 114 equations, 11 figures, 2 tables.

Key Result

Lemma 4.1

Suppose $g: \mathcal{V}^T \mapsto \mathbb{R}^d$ is the non-linear mapping from the prompt to the last hidden state, which is connected to the model output $f_\theta(x)$ via the learnable unembedding layer matrix $W$. The dynamics for the reward margin under the gradient flow of the weight matrix can where $r_i$ is the shorthand notation for reward margin of sample $x_i$, $\tau$ is an inverse learn

Figures (11)

  • Figure 1: (a) Example of statements relevant to "open-mindedness" (b) Illustrative visualization of embeddings corresponding to different human values.
  • Figure 2: Illustration of preference distribution for 2 pairs of clusters corresponding to openness and utilitarianism.
  • Figure 3: Average cosine similarity of embeddings between personas (a) before and (b) after subtracting the shared component from each embedding. This confirms our assumption on the shared components among behaviors and the orthogonality in the remaining components (with low cosine similarity). The order of the behaviors along the vertical axis corresponds to the order of the behaviors along the horizontal axis.
  • Figure 4: Scaling curve for the generalization error to be $<0.05$.
  • Figure 5: Average reward margins for the training/test set over the course of last-layer training (a, b) and full fine-tuning (c, d) across increasing number of human values $K$.
  • ...and 6 more figures

Theorems & Definitions (14)

  • Definition 3.1: Population Risk of Preference Learning
  • Lemma 4.1
  • Theorem 4.2: Training Reward Guarantee
  • Theorem 4.3: Generalization Error
  • Lemma B.1
  • Lemma B.2
  • Theorem B.3
  • Theorem B.4
  • Definition C.1: $\delta$-approximately orthogonal clusters
  • Lemma C.2: Pairwise approximate orthogonality
  • ...and 4 more