Can DPO Learn Diverse Human Values? A Theoretical Scaling Law

Shawn Im; Sharon Li

Can DPO Learn Diverse Human Values? A Theoretical Scaling Law

Shawn Im, Sharon Li

TL;DR

This work tackles how diversity in human values affects generalization in preference learning for LLMs trained with Direct Preference Optimization (DPO). It introduces a theoretical framework that models preferences as a mixture of value-cluster distributions and analyzes finite-step training via reward-margin dynamics, deriving bounds and a scaling law stating that the required samples per value grow as $Θ(\log K)$ with the number of distinct values. The results connect training dynamics to generalization performance and are supported by empirical validation across contemporary LLMs and preference datasets, illustrating practical implications for data collection and model alignment in pluralistic settings. The framework extends to multi-token generation and generalizes to other preference objectives, offering a principled lens on failure modes and future directions for robust, value-diverse alignment.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities but often struggle to align with human preferences, leading to harmful or undesirable outputs. Preference learning, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for ensuring that LLMs align with human values. An essential part of ensuring that LLMs are aligned for all people is accounting for a diverse set of values. This paper introduces a new theoretical framework to analyze how generalization scales with value diversity and sample quantity in models trained with direct preference optimization. Our framework rigorously assesses how well models generalize after a finite number of gradient steps, reflecting real-world LLM training practices. By analyzing the reward margin associated with each sample and its trajectory throughout training, we provide a bound on the generalization error that demonstrates the challenges of effectively learning a wide set of concepts or values. These insights are empirically validated on contemporary LLMs, underscoring the practical relevance of our theory.

Can DPO Learn Diverse Human Values? A Theoretical Scaling Law

TL;DR

with the number of distinct values. The results connect training dynamics to generalization performance and are supported by empirical validation across contemporary LLMs and preference datasets, illustrating practical implications for data collection and model alignment in pluralistic settings. The framework extends to multi-token generation and generalizes to other preference objectives, offering a principled lens on failure modes and future directions for robust, value-diverse alignment.

Abstract

Paper Structure (52 sections, 12 theorems, 114 equations, 11 figures, 2 tables)

This paper contains 52 sections, 12 theorems, 114 equations, 11 figures, 2 tables.

Introduction
A Motivating Example
Preliminaries and Theoretical Setup
Model.
Reward margin.
Characterizing the diverse preference distribution.
Theoretical Framework and Guarantees
Practicality of our framework.
Reward Dynamics
Interpretation of reward dynamics.
Theoretical Guarantees
Practical implications of value diversity.
Extension to Multi-Token Generation
Reward decomposition in multi-token generation.
Reward dynamics in multi-token generation.
...and 37 more sections

Key Result

Lemma 4.1

Suppose $g: \mathcal{V}^T \mapsto \mathbb{R}^d$ is the non-linear mapping from the prompt to the last hidden state, which is connected to the model output $f_\theta(x)$ via the learnable unembedding layer matrix $W$. The dynamics for the reward margin under the gradient flow of the weight matrix can where $r_i$ is the shorthand notation for reward margin of sample $x_i$, $\tau$ is an inverse learn

Figures (11)

Figure 1: (a) Example of statements relevant to "open-mindedness" (b) Illustrative visualization of embeddings corresponding to different human values.
Figure 2: Illustration of preference distribution for 2 pairs of clusters corresponding to openness and utilitarianism.
Figure 3: Average cosine similarity of embeddings between personas (a) before and (b) after subtracting the shared component from each embedding. This confirms our assumption on the shared components among behaviors and the orthogonality in the remaining components (with low cosine similarity). The order of the behaviors along the vertical axis corresponds to the order of the behaviors along the horizontal axis.
Figure 4: Scaling curve for the generalization error to be $<0.05$.
Figure 5: Average reward margins for the training/test set over the course of last-layer training (a, b) and full fine-tuning (c, d) across increasing number of human values $K$.
...and 6 more figures

Theorems & Definitions (14)

Definition 3.1: Population Risk of Preference Learning
Lemma 4.1
Theorem 4.2: Training Reward Guarantee
Theorem 4.3: Generalization Error
Lemma B.1
Lemma B.2
Theorem B.3
Theorem B.4
Definition C.1: $\delta$-approximately orthogonal clusters
Lemma C.2: Pairwise approximate orthogonality
...and 4 more

Can DPO Learn Diverse Human Values? A Theoretical Scaling Law

TL;DR

Abstract

Can DPO Learn Diverse Human Values? A Theoretical Scaling Law

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (14)