Can DPO Learn Diverse Human Values? A Theoretical Scaling Law
Shawn Im, Sharon Li
TL;DR
This work tackles how diversity in human values affects generalization in preference learning for LLMs trained with Direct Preference Optimization (DPO). It introduces a theoretical framework that models preferences as a mixture of value-cluster distributions and analyzes finite-step training via reward-margin dynamics, deriving bounds and a scaling law stating that the required samples per value grow as $Θ(\log K)$ with the number of distinct values. The results connect training dynamics to generalization performance and are supported by empirical validation across contemporary LLMs and preference datasets, illustrating practical implications for data collection and model alignment in pluralistic settings. The framework extends to multi-token generation and generalizes to other preference objectives, offering a principled lens on failure modes and future directions for robust, value-diverse alignment.
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities but often struggle to align with human preferences, leading to harmful or undesirable outputs. Preference learning, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for ensuring that LLMs align with human values. An essential part of ensuring that LLMs are aligned for all people is accounting for a diverse set of values. This paper introduces a new theoretical framework to analyze how generalization scales with value diversity and sample quantity in models trained with direct preference optimization. Our framework rigorously assesses how well models generalize after a finite number of gradient steps, reflecting real-world LLM training practices. By analyzing the reward margin associated with each sample and its trajectory throughout training, we provide a bound on the generalization error that demonstrates the challenges of effectively learning a wide set of concepts or values. These insights are empirically validated on contemporary LLMs, underscoring the practical relevance of our theory.
