Dissecting Human and LLM Preferences

Junlong Li; Fan Zhou; Shichao Sun; Yikai Zhang; Hai Zhao; Pengfei Liu

Dissecting Human and LLM Preferences

Junlong Li, Fan Zhou, Shichao Sun, Yikai Zhang, Hai Zhao, Pengfei Liu

TL;DR

This work introduces Preference Dissection, a framework that decomposes human and 32 LLM preferences into a quantitative mix of 29 clearly defined properties, estimated via Bayesian logistic regression from scenario-balanced, real-world conversations. By annotating responses with basic, query-specific, and error-detection properties and modeling the resulting comparison features φ, the authors quantify how each property contributes to overall preference through weights α. The study reveals that humans are more tolerant of errors and dislike when models admit limits, whereas advanced LLMs like GPT-4-Turbo prioritize correctness, clarity, and harmlessness, with model size correlating with similarity in preferences across models. Furthermore, the work demonstrates that preference-based evaluation can be intentionally manipulated, showing that both training-free and training-based adaptations to judge preferences can shift benchmark scores, highlighting the need for robust evaluation protocols and the public release of data and tools for replication and further research. The methodology combines a realistic dataset, a comprehensive property taxonomy, and Bayesian inference to provide actionable insights for alignment and evaluation in LLMs, with practical implications for mitigating reward hacking and improving reliability.

Abstract

As a relative quality comparison of model responses, human and Large Language Model (LLM) preferences serve as common alignment goals in model fine-tuning and criteria in evaluation. Yet, these preferences merely reflect broad tendencies, resulting in less explainable and controllable models with potential safety risks. In this work, we dissect the preferences of human and 32 different LLMs to understand their quantitative composition, using annotations from real-world user-model conversations for a fine-grained, scenario-wise analysis. We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits. On the contrary, advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more. Additionally, LLMs of similar sizes tend to exhibit similar preferences, regardless of their training methods, and fine-tuning for alignment does not significantly alter the preferences of pretrained-only LLMs. Finally, we show that preference-based evaluation can be intentionally manipulated. In both training-free and training-based settings, aligning a model with the preferences of judges boosts scores, while injecting the least preferred properties lowers them. This results in notable score shifts: up to 0.59 on MT-Bench (1-10 scale) and 31.94 on AlpacaEval 2.0 (0-100 scale), highlighting the significant impact of this strategic adaptation. Interactive Demo: https://huggingface.co/spaces/GAIR/Preference-Dissection-Visualization Dataset: https://huggingface.co/datasets/GAIR/preference-dissection Code: https://github.com/GAIR-NLP/Preference-Dissection

Dissecting Human and LLM Preferences

TL;DR

Abstract

Dissecting Human and LLM Preferences

Authors

TL;DR

Abstract

Table of Contents

Figures (12)