Diverse Preference Learning for Capabilities and Alignment
Stewart Slocum, Asher Parker-Sartori, Dylan Hadfield-Menell
TL;DR
This work identifies KL-regularization in RLHF and DPO as a key driver of diversity loss and mode collapse toward majority preferences. It introduces Soft Preference Learning (SPL), which decouples the entropy and cross-entropy components of the KL penalty, yielding a sequence-level policy $π(y|x) ∝ π_{ref}(y|x)^{β/α} p^{1/α}$ that can interpolate between proportional representation and maintained bias toward a reference policy. Theoretical analysis using social choice ideas shows how standard objectives overweight majority opinions, while SPL enables proportional representation and better calibration. Empirically, SPL improves diversity without catastrophic quality loss in chat settings, enhances best-of-N problem-solving performance on hard mathematical tasks, and yields better logit calibration on MCQ benchmarks, establishing a practical, Pareto-improving alternative to temperature scaling for alignment and capability enhancement.
Abstract
The ability of LLMs to represent diverse perspectives is critical as they increasingly impact society. However, recent studies reveal that alignment algorithms such as RLHF and DPO significantly reduce the diversity of LLM outputs. Not only do aligned LLMs generate text with repetitive structure and word choice, they also approach problems in more uniform ways, and their responses reflect a narrower range of societal perspectives. We attribute this problem to the KL divergence regularizer employed in preference learning algorithms. This causes the model to systematically overweight majority opinions and sacrifice diversity in its outputs. To address this, we propose Soft Preference Learning, which decouples the entropy and cross-entropy terms in the KL penalty - allowing for fine-grained control over LLM generation diversity. From a capabilities perspective, LLMs trained using Soft Preference Learning attain higher accuracy on difficult repeated sampling tasks and produce outputs with greater semantic and lexical diversity. From an alignment perspective, they are capable of representing a wider range of societal viewpoints and display improved logit calibration. Notably, Soft Preference Learning resembles, but is a Pareto improvement over, standard temperature scaling.
