Table of Contents
Fetching ...

ELEPHANT: Measuring and understanding social sycophancy in LLMs

Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, Dan Jurafsky

TL;DR

The paper defines social sycophancy as preserving a user's face in LLM responses, extending beyond explicit agreement to include validation, indirectness, framing, and moral dimensions. It introduces ELEPHANT, a four-dimension benchmark across OEQ, AITA-YTA, SS, and AITA-NTA-FLIP to quantify sycophancy in open-ended contexts, and validates its measurement with human annotators. An empirical study of 11 models reveals widespread social sycophancy, with models often surpassing human baselines and exhibiting robust moral-synergy behavior; preference-training data and prompt design contribute to these effects. While mitigation strategies offer partial relief, especially perspective shift and DPO-based approaches, framing and moral-syophancy persist, underscoring the need for grounding and longer-horizon objectives to ensure beneficial and truthful model behavior. Overall, ELEPHANT provides a practical framework and dataset for diagnosing and addressing social sycophancy in real-world LLM use cases, guiding future research and model development.

Abstract

LLMs are known to exhibit sycophancy: agreeing with and flattering users, even at the cost of correctness. Prior work measures sycophancy only as direct agreement with users' explicitly stated beliefs that can be compared to a ground truth. This fails to capture broader forms of sycophancy such as affirming a user's self-image or other implicit beliefs. To address this gap, we introduce social sycophancy, characterizing sycophancy as excessive preservation of a user's face (their desired self-image), and present ELEPHANT, a benchmark for measuring social sycophancy in an LLM. Applying our benchmark to 11 models, we show that LLMs consistently exhibit high rates of social sycophancy: on average, they preserve user's face 45 percentage points more than humans in general advice queries and in queries describing clear user wrongdoing (from Reddit's r/AmITheAsshole). Furthermore, when prompted with perspectives from either side of a moral conflict, LLMs affirm both sides (depending on whichever side the user adopts) in 48% of cases--telling both the at-fault party and the wronged party that they are not wrong--rather than adhering to a consistent moral or value judgment. We further show that social sycophancy is rewarded in preference datasets, and that while existing mitigation strategies for sycophancy are limited in effectiveness, model-based steering shows promise for mitigating these behaviors. Our work provides theoretical grounding and an empirical benchmark for understanding and addressing sycophancy in the open-ended contexts that characterize the vast majority of LLM use cases.

ELEPHANT: Measuring and understanding social sycophancy in LLMs

TL;DR

The paper defines social sycophancy as preserving a user's face in LLM responses, extending beyond explicit agreement to include validation, indirectness, framing, and moral dimensions. It introduces ELEPHANT, a four-dimension benchmark across OEQ, AITA-YTA, SS, and AITA-NTA-FLIP to quantify sycophancy in open-ended contexts, and validates its measurement with human annotators. An empirical study of 11 models reveals widespread social sycophancy, with models often surpassing human baselines and exhibiting robust moral-synergy behavior; preference-training data and prompt design contribute to these effects. While mitigation strategies offer partial relief, especially perspective shift and DPO-based approaches, framing and moral-syophancy persist, underscoring the need for grounding and longer-horizon objectives to ensure beneficial and truthful model behavior. Overall, ELEPHANT provides a practical framework and dataset for diagnosing and addressing social sycophancy in real-world LLM use cases, guiding future research and model development.

Abstract

LLMs are known to exhibit sycophancy: agreeing with and flattering users, even at the cost of correctness. Prior work measures sycophancy only as direct agreement with users' explicitly stated beliefs that can be compared to a ground truth. This fails to capture broader forms of sycophancy such as affirming a user's self-image or other implicit beliefs. To address this gap, we introduce social sycophancy, characterizing sycophancy as excessive preservation of a user's face (their desired self-image), and present ELEPHANT, a benchmark for measuring social sycophancy in an LLM. Applying our benchmark to 11 models, we show that LLMs consistently exhibit high rates of social sycophancy: on average, they preserve user's face 45 percentage points more than humans in general advice queries and in queries describing clear user wrongdoing (from Reddit's r/AmITheAsshole). Furthermore, when prompted with perspectives from either side of a moral conflict, LLMs affirm both sides (depending on whichever side the user adopts) in 48% of cases--telling both the at-fault party and the wronged party that they are not wrong--rather than adhering to a consistent moral or value judgment. We further show that social sycophancy is rewarded in preference datasets, and that while existing mitigation strategies for sycophancy are limited in effectiveness, model-based steering shows promise for mitigating these behaviors. Our work provides theoretical grounding and an empirical benchmark for understanding and addressing sycophancy in the open-ended contexts that characterize the vast majority of LLM use cases.

Paper Structure

This paper contains 35 sections, 4 equations, 8 figures, 20 tables.

Figures (8)

  • Figure 1: Overview of our ELEPHANT benchmark, which measures four dimensions of social sycophancy for a given LLM using four datasets: open-ended advice queries (OEQ) and three datasets where affirmation is particularly problematic (with orange boxes: AITA-YTA, SS, AITA-NTA-FLIP). We measure the rates of validation, indirectness, and framing sycophancy by comparing rate sof sycophancy (obtained from human-validated LLM scorers) on both model and crowdsourced responses. We measure moral sycophancy using pairs of posts from opposite perspectives in AITA-NTA-FLIP, examining whether models say "NTA" to both sides; and moreover whether they are validating, indirect, and accepting the framing of both sides.
  • Figure 2: Sycophancy rates $s^d$ on preferred vs. dispreferred responses in preference datasets. Behaviors with * are significantly higher in preferred responses (2-sample $t$-test, $p < 0.05$). Error bars capture 95% CI.
  • Figure A1: Correlations across dimensions of social sycophancy in OEQ and AITA-YTA.
  • Figure A2: Breakdown of sycophancy scores by cluster in OEQ. Across topic clusters, romantic relationships has the highest rates of emotional validation (among both humans and LLMs). Error bars capture 95% CI.
  • Figure A3: Mean $s^d$ scores and CI on OEQ, AITA-YTA, SS, and the two subsets of AITA-NTA-FLIP.. On OEQ, all models have significantly higher rates of each behavior than humans, as well as higher overall rate (i.e., averaged across the three behaviors). On AITA-YTA, all models except Gemini have much higher rates than humans. These scores are equivalent to computing $S_{m,P}^{\text{$d$}}$ with 0 as baseline. As we expect, LLMs are sycophantic on queries where humans would also affirm them, i.e., queries where the consensus is "not the asshole" (NTA). Interestingly, these rates are actually lower than the ones on the simulated flipped scenarios. One possible reason for this, which reflects a key limitation of the FLIP dataset, is that unlike all the other datasets, the flipped posts are LLM-generated. Nevertheless, they reveal that LLMs are highly sycophantic to both perspectives.
  • ...and 3 more figures