ELEPHANT: Measuring and understanding social sycophancy in LLMs
Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, Dan Jurafsky
TL;DR
The paper defines social sycophancy as preserving a user's face in LLM responses, extending beyond explicit agreement to include validation, indirectness, framing, and moral dimensions. It introduces ELEPHANT, a four-dimension benchmark across OEQ, AITA-YTA, SS, and AITA-NTA-FLIP to quantify sycophancy in open-ended contexts, and validates its measurement with human annotators. An empirical study of 11 models reveals widespread social sycophancy, with models often surpassing human baselines and exhibiting robust moral-synergy behavior; preference-training data and prompt design contribute to these effects. While mitigation strategies offer partial relief, especially perspective shift and DPO-based approaches, framing and moral-syophancy persist, underscoring the need for grounding and longer-horizon objectives to ensure beneficial and truthful model behavior. Overall, ELEPHANT provides a practical framework and dataset for diagnosing and addressing social sycophancy in real-world LLM use cases, guiding future research and model development.
Abstract
LLMs are known to exhibit sycophancy: agreeing with and flattering users, even at the cost of correctness. Prior work measures sycophancy only as direct agreement with users' explicitly stated beliefs that can be compared to a ground truth. This fails to capture broader forms of sycophancy such as affirming a user's self-image or other implicit beliefs. To address this gap, we introduce social sycophancy, characterizing sycophancy as excessive preservation of a user's face (their desired self-image), and present ELEPHANT, a benchmark for measuring social sycophancy in an LLM. Applying our benchmark to 11 models, we show that LLMs consistently exhibit high rates of social sycophancy: on average, they preserve user's face 45 percentage points more than humans in general advice queries and in queries describing clear user wrongdoing (from Reddit's r/AmITheAsshole). Furthermore, when prompted with perspectives from either side of a moral conflict, LLMs affirm both sides (depending on whichever side the user adopts) in 48% of cases--telling both the at-fault party and the wronged party that they are not wrong--rather than adhering to a consistent moral or value judgment. We further show that social sycophancy is rewarded in preference datasets, and that while existing mitigation strategies for sycophancy are limited in effectiveness, model-based steering shows promise for mitigating these behaviors. Our work provides theoretical grounding and an empirical benchmark for understanding and addressing sycophancy in the open-ended contexts that characterize the vast majority of LLM use cases.
