Table of Contents
Fetching ...

Using cognitive models to reveal value trade-offs in language models

Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier, Peng Qian, Tomer Ullman

TL;DR

This paper embeds a rational-pragmatic, RSA-based cognitive model of polite speech into the interpretation of LLM behavior, framing value trade-offs as a mixture of informational, social, and presentational utilities. By inferring parameters with Bayesian methods from responses to a controlled politeness task, the authors examine how reasoning budgets and communicative-goal prompts shape these trade-offs across closed- and open-source model suites, including RLHF post-training dynamics. Key findings show a default emphasis on informational utility in reasoning-enabled models, systematic shifts when goals are manipulated, and early, enduring shifts in open-source training that are strongly affected by base model and pretraining data. The framework enables fine-grained hypotheses about social behaviors like sycophancy and offers a practical tool for guiding training regimes to balance values during model development, with broad implications for interpretability and alignment.

Abstract

Value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in LLMs are limited. In cognitive science, so-called "cognitive models" provide formal accounts of such trade-offs in humans, by modeling the weighting of a speaker's competing utility functions in choosing an action or utterance. Here we use a leading cognitive model of polite speech to systematically evaluate value trade-offs in two encompassing model settings: degrees of reasoning "effort" in frontier black-box models, and RL post-training dynamics of open-source models. Our results highlight patterns of higher informational utility than social utility in reasoning models' default behavior, and demonstrate that these patterns shift in predictable ways when models are prompted to prioritize certain goals over others. Our findings from LLMs' training dynamics suggest large shifts in utility values early on in training with persistent effects of the choice of base model and pretraining data, compared to feedback dataset or alignment method. Our framework offers a flexible tool for probing value trade-offs across diverse model types, providing insights for generating hypotheses about other social behaviors such as sycophancy and for shaping training regimes that better control trade-offs between values during model development.

Using cognitive models to reveal value trade-offs in language models

TL;DR

This paper embeds a rational-pragmatic, RSA-based cognitive model of polite speech into the interpretation of LLM behavior, framing value trade-offs as a mixture of informational, social, and presentational utilities. By inferring parameters with Bayesian methods from responses to a controlled politeness task, the authors examine how reasoning budgets and communicative-goal prompts shape these trade-offs across closed- and open-source model suites, including RLHF post-training dynamics. Key findings show a default emphasis on informational utility in reasoning-enabled models, systematic shifts when goals are manipulated, and early, enduring shifts in open-source training that are strongly affected by base model and pretraining data. The framework enables fine-grained hypotheses about social behaviors like sycophancy and offers a practical tool for guiding training regimes to balance values during model development, with broad implications for interpretability and alignment.

Abstract

Value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in LLMs are limited. In cognitive science, so-called "cognitive models" provide formal accounts of such trade-offs in humans, by modeling the weighting of a speaker's competing utility functions in choosing an action or utterance. Here we use a leading cognitive model of polite speech to systematically evaluate value trade-offs in two encompassing model settings: degrees of reasoning "effort" in frontier black-box models, and RL post-training dynamics of open-source models. Our results highlight patterns of higher informational utility than social utility in reasoning models' default behavior, and demonstrate that these patterns shift in predictable ways when models are prompted to prioritize certain goals over others. Our findings from LLMs' training dynamics suggest large shifts in utility values early on in training with persistent effects of the choice of base model and pretraining data, compared to feedback dataset or alignment method. Our framework offers a flexible tool for probing value trade-offs across diverse model types, providing insights for generating hypotheses about other social behaviors such as sycophancy and for shaping training regimes that better control trade-offs between values during model development.

Paper Structure

This paper contains 40 sections, 3 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: We use cognitive models that are designed to explain the structure of human behavior (top) to interpret how low-level training decisions impact LLMs' representations of human-like value trade-offs (bottom).
  • Figure 2: Paradigm overview. (1) We collected LLMs' responses in a polite speech task, and fit a well-established probabilistic generative model of the behavior from yoon2020polite to these data. (2) We report the results of the following inferred parameters of this model for two suites of LLMs: $\phi$, which describes the first-order speaker's weighting of informational and social utilities, and $\omega$, which describes the second-order speaker's weighting of informational, social, and presentational utilities. (3) A schematic illustration of the cognitive model of polite speech.
  • Figure 3: Closed-source LLM results. Inferred values of informational, social, and presentational utilities $\boldsymbol{\omega}$, and projected mixture of informational and social utilities $\phi$, according to the cognitive model for LLMs with varying degrees of reasoning budget. Dotted lines plot model-specific results under goal conditions, averaged over framings. Solid lines show mean results across models. We find that reasoning variants prioritize information-utility over social-utility, and that goal-condition prompt manipulations shift these utility patterns in predictable ways.
  • Figure 4: Communicative goals. Comparison of the inferred weightings of informational, social, and presentational utilities, as well as the projected trade-off $\phi$ between informational and social goals, across humans and closed-source LLMs under various manipulations of the speaker's goals. Human results were taken from yoon2020polite. Error bars indicate 95% high density region averaged over three framing manipulations crossing with three levels of reasoning budgets. We find that goal-condition prompts shift LLMs' behavior consistently across model families, but more severely than when humans are asked to take on these same goals.
  • Figure 5: Open-source LLM results. Inferred values of informational, social, and presentational utilities $\boldsymbol{\omega}$ (purple), and projected mixture of informational and social utilities $\phi$ (magenta), according to a cognitive model for LLMs' training checkpoints across the RLHF process. Line variants indicate different combinations of base model and feedback dataset; rows = alignment method. Error bars indicate 95% high density region averaged across results from three framing manipulations. We find the largest shifts in values within the first quarter of training, with persistent effects of the choice of base model and pretraining data, compared to feedback dataset or alignment method.
  • ...and 13 more figures