Using cognitive models to reveal value trade-offs in language models
Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier, Peng Qian, Tomer Ullman
TL;DR
This paper embeds a rational-pragmatic, RSA-based cognitive model of polite speech into the interpretation of LLM behavior, framing value trade-offs as a mixture of informational, social, and presentational utilities. By inferring parameters with Bayesian methods from responses to a controlled politeness task, the authors examine how reasoning budgets and communicative-goal prompts shape these trade-offs across closed- and open-source model suites, including RLHF post-training dynamics. Key findings show a default emphasis on informational utility in reasoning-enabled models, systematic shifts when goals are manipulated, and early, enduring shifts in open-source training that are strongly affected by base model and pretraining data. The framework enables fine-grained hypotheses about social behaviors like sycophancy and offers a practical tool for guiding training regimes to balance values during model development, with broad implications for interpretability and alignment.
Abstract
Value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in LLMs are limited. In cognitive science, so-called "cognitive models" provide formal accounts of such trade-offs in humans, by modeling the weighting of a speaker's competing utility functions in choosing an action or utterance. Here we use a leading cognitive model of polite speech to systematically evaluate value trade-offs in two encompassing model settings: degrees of reasoning "effort" in frontier black-box models, and RL post-training dynamics of open-source models. Our results highlight patterns of higher informational utility than social utility in reasoning models' default behavior, and demonstrate that these patterns shift in predictable ways when models are prompted to prioritize certain goals over others. Our findings from LLMs' training dynamics suggest large shifts in utility values early on in training with persistent effects of the choice of base model and pretraining data, compared to feedback dataset or alignment method. Our framework offers a flexible tool for probing value trade-offs across diverse model types, providing insights for generating hypotheses about other social behaviors such as sycophancy and for shaping training regimes that better control trade-offs between values during model development.
