Table of Contents
Fetching ...

Diverse Preference Learning for Capabilities and Alignment

Stewart Slocum, Asher Parker-Sartori, Dylan Hadfield-Menell

TL;DR

This work identifies KL-regularization in RLHF and DPO as a key driver of diversity loss and mode collapse toward majority preferences. It introduces Soft Preference Learning (SPL), which decouples the entropy and cross-entropy components of the KL penalty, yielding a sequence-level policy $π(y|x) ∝ π_{ref}(y|x)^{β/α} p^{1/α}$ that can interpolate between proportional representation and maintained bias toward a reference policy. Theoretical analysis using social choice ideas shows how standard objectives overweight majority opinions, while SPL enables proportional representation and better calibration. Empirically, SPL improves diversity without catastrophic quality loss in chat settings, enhances best-of-N problem-solving performance on hard mathematical tasks, and yields better logit calibration on MCQ benchmarks, establishing a practical, Pareto-improving alternative to temperature scaling for alignment and capability enhancement.

Abstract

The ability of LLMs to represent diverse perspectives is critical as they increasingly impact society. However, recent studies reveal that alignment algorithms such as RLHF and DPO significantly reduce the diversity of LLM outputs. Not only do aligned LLMs generate text with repetitive structure and word choice, they also approach problems in more uniform ways, and their responses reflect a narrower range of societal perspectives. We attribute this problem to the KL divergence regularizer employed in preference learning algorithms. This causes the model to systematically overweight majority opinions and sacrifice diversity in its outputs. To address this, we propose Soft Preference Learning, which decouples the entropy and cross-entropy terms in the KL penalty - allowing for fine-grained control over LLM generation diversity. From a capabilities perspective, LLMs trained using Soft Preference Learning attain higher accuracy on difficult repeated sampling tasks and produce outputs with greater semantic and lexical diversity. From an alignment perspective, they are capable of representing a wider range of societal viewpoints and display improved logit calibration. Notably, Soft Preference Learning resembles, but is a Pareto improvement over, standard temperature scaling.

Diverse Preference Learning for Capabilities and Alignment

TL;DR

This work identifies KL-regularization in RLHF and DPO as a key driver of diversity loss and mode collapse toward majority preferences. It introduces Soft Preference Learning (SPL), which decouples the entropy and cross-entropy components of the KL penalty, yielding a sequence-level policy that can interpolate between proportional representation and maintained bias toward a reference policy. Theoretical analysis using social choice ideas shows how standard objectives overweight majority opinions, while SPL enables proportional representation and better calibration. Empirically, SPL improves diversity without catastrophic quality loss in chat settings, enhances best-of-N problem-solving performance on hard mathematical tasks, and yields better logit calibration on MCQ benchmarks, establishing a practical, Pareto-improving alternative to temperature scaling for alignment and capability enhancement.

Abstract

The ability of LLMs to represent diverse perspectives is critical as they increasingly impact society. However, recent studies reveal that alignment algorithms such as RLHF and DPO significantly reduce the diversity of LLM outputs. Not only do aligned LLMs generate text with repetitive structure and word choice, they also approach problems in more uniform ways, and their responses reflect a narrower range of societal perspectives. We attribute this problem to the KL divergence regularizer employed in preference learning algorithms. This causes the model to systematically overweight majority opinions and sacrifice diversity in its outputs. To address this, we propose Soft Preference Learning, which decouples the entropy and cross-entropy terms in the KL penalty - allowing for fine-grained control over LLM generation diversity. From a capabilities perspective, LLMs trained using Soft Preference Learning attain higher accuracy on difficult repeated sampling tasks and produce outputs with greater semantic and lexical diversity. From an alignment perspective, they are capable of representing a wider range of societal viewpoints and display improved logit calibration. Notably, Soft Preference Learning resembles, but is a Pareto improvement over, standard temperature scaling.

Paper Structure

This paper contains 22 sections, 5 theorems, 31 equations, 12 figures, 3 tables.

Key Result

Proposition 3.1

Suppose a population of raters prefers completion $y \succ y'$ with probability $p$. Then RLHF (or DPO) with KL-regularization penalty $\beta$ has the optimal policy

Figures (12)

  • Figure 1: Soft Preference Learning increases output diversity while preserving quality. DPO responses are well-formed but lack diversity (e.g. same doctor name, gender, and family relationship to patient). With temperature scaling ($t = 1.4$), DPO generates responses with more diversity at the cost of fluency and token-level aberrations. In particular, temperature scaling results in many non-word tokens. Meanwhile, SPL at global temperature $\alpha / \beta = 2$ similarly increases diversity, but with significantly less degradation.
  • Figure 2: Improved diversity-quality tradeoffs with SPL. We construct diversity-quality Pareto curves contrasting DPO with token-level temperature scaling against SPL (by modulating the entropy term). We also plot the performance of DPO with min-p, top-p, and top-k sampling, which can improve diversity-quality tradeoffs when sampling at high temperatures. We plot points that lie below the Pareto curve in lighter shades. SPL Pareto-dominates DPO with standard temperature scaling across all nine metrics, and it outperforms all sampling methods on six.
  • Figure 3: SPL improves best-of-N mathematical problem-solving on difficult instances. Left three columns show best-of-N accuracy across difficulty levels. Right column shows performance on hard problems relative to DPO at a given sample count. For easier problems, standard DPO ($t=1$) performs well. However, hard repeated sampling tasks benefit from diverse solution strategies. On hard problems, both token-level temperature sampling and SPL improve best-of-N accuracy. However, SPL achieves a better quality-diversity tradeoff, especially at high temperatures where token-level scaling rapidly degrades quality. This makes SPL particularly effective for generating diverse yet high-quality solutions.
  • Figure 4: SPL improves both calibration and accuracy on multiple-choice question (MCQ) datasets. We plot model accuracy, Expected Calibration Error (ECE), and Brier Score for all models on both TruthfulQA and MMLU. The DPO model (equivalent to SPL with global temperature 1) displays significantly worse calibration than the base model. In contrast, SPL models consistently exhibit improved calibration without sacrificing accuracy.
  • Figure 5: Diversity-quality tradeoffs with additional diversity metrics. Here, Sentence-BERT is an additional expected cosine distance metric in addition to "Embedding Cosine Distance" which uses the OpenAI Embeddings API. The remaining four diversity metrics are evaluated with an LLM judge on pools of responses. For SPL, we perform a global temperature sweep in range $t \in [1, 11]$. For all other methods, we sweep until outputs degenerate into unintelligible text. For some diversity metrics, such as content, surface form, and perspective diversity, low-quality generations are often rated as less diverse (see examples below). We label points off the Pareto frontier to help identify these cases.
  • ...and 7 more figures

Theorems & Definitions (9)

  • Proposition 3.1: Two-Outcome RLHF Policy
  • Proposition 3.2: Two-Outcome SPL Policy
  • Corollary 3.1
  • proof
  • proof
  • Proposition A.1: Multi-Outcome SPL Policy
  • proof
  • Proposition A.2: SPL DPO Derivation
  • proof