Table of Contents
Fetching ...

Verbalizing LLMs' assumptions to explain and control sycophancy

Myra Cheng, Isabel Sieh, Humishka Zope, Sunny Yu, Lujain Ibrahim, Aryaman Arora, Jared Moore, Desmond Ong, Dan Jurafsky, Diyi Yang

Abstract

LLMs can be socially sycophantic, affirming users when they ask questions like "am I in the wrong?" rather than providing genuine assessment. We hypothesize that this behavior arises from incorrect assumptions about the user, like underestimating how often users are seeking information over reassurance. We present Verbalized Assumptions, a framework for eliciting these assumptions from LLMs. Verbalized Assumptions provide insight into LLM sycophancy, delusion, and other safety issues, e.g., the top bigram in LLMs' assumptions on social sycophancy datasets is ``seeking validation.'' We provide evidence for a causal link between Verbalized Assumptions and sycophantic model behavior: our assumption probes (linear probes trained on internal representations of these assumptions) enable interpretable fine-grained steering of social sycophancy. We explore why LLMs default to sycophantic assumptions: on identical queries, people expect more objective and informative responses from AI than from other humans, but LLMs trained on human-human conversation do not account for this difference in expectations. Our work contributes a new understanding of assumptions as a mechanism for sycophancy.

Verbalizing LLMs' assumptions to explain and control sycophancy

Abstract

LLMs can be socially sycophantic, affirming users when they ask questions like "am I in the wrong?" rather than providing genuine assessment. We hypothesize that this behavior arises from incorrect assumptions about the user, like underestimating how often users are seeking information over reassurance. We present Verbalized Assumptions, a framework for eliciting these assumptions from LLMs. Verbalized Assumptions provide insight into LLM sycophancy, delusion, and other safety issues, e.g., the top bigram in LLMs' assumptions on social sycophancy datasets is ``seeking validation.'' We provide evidence for a causal link between Verbalized Assumptions and sycophantic model behavior: our assumption probes (linear probes trained on internal representations of these assumptions) enable interpretable fine-grained steering of social sycophancy. We explore why LLMs default to sycophantic assumptions: on identical queries, people expect more objective and informative responses from AI than from other humans, but LLMs trained on human-human conversation do not account for this difference in expectations. Our work contributes a new understanding of assumptions as a mechanism for sycophancy.

Paper Structure

This paper contains 33 sections, 1 equation, 32 figures, 21 tables.

Figures (32)

  • Figure 1: LLMs' internals encode assumptions about the user, which we elicit using Verbalized Assumptions, and these assumptions are causally linked to sycophancy. Using Verbalized Assumptions as the training target for linear probes, we identify subspaces of LLMs' internal representations that can be steered to decrease social sycophancy.
  • Figure 2: Assumption scores for $S^+$ and $S^-$ by dataset (top) and model (bottom left). Social sycophancy datasets have higher scores on $S^+$ (the dimensions we hypothesize to increase sycophancy), while the non-social questions have higher $S^-$ (dimensions that we hypothesize decrease sycophancy). Among models, Gemini has highest $S^-$ scores while GPT-4o has highest $S^+$ . Assumption scores in delusion transcripts (bottom right.)$S^+$ scores are significantly higher than $S^-$ and significantly increase throughout conversations for GPT-4o and Gemini; see App. \ref{['app:means']}. Error bars are 95% CI.
  • Figure 3: Steering with assumption probes reduces social sycophancy.Validation sycophancy increases with $S^+$ and decreases with $S^{-'}$ . Indirectness increases with $S^+$, and framing decreases with $S^{-'}$ . Shaded error is 95% CI; Spearman $\rho$ with * $p < 0.05$, ** $p < 0.01$, *** $p < 0.001$.
  • Figure 4: Human-AI expectation gap. People's expectations for how other humans vs. AI respond differ significantly, but LLMs' assumptions of users only reflect human-human expectations.
  • Figure A1: Open-ended assumptions' alignment with human annotators. The top row is comparing the top model to the 5th and 10th best, as in the main text. Bottom row: We also ran a version where annotators selected the most accurate mental model from three candidates (the most-probable and the two least-probable among these top 10) and rated its quality. Here, larger LLMs again perform slightly better, with higher quality (3.95/5) and higher rate of picking the top mental model (65% for GPT and Gemini), though all LLMs had a mean rating $>3.7$ and the top model was picked more frequently than random ($>53\%$ for all).
  • ...and 27 more figures