Flattering to Deceive: The Impact of Sycophantic Behavior on User Trust in Large Language Model
María Victoria Carro
TL;DR
This study examines how sycophantic outputs from a large language model affect user trust. By contrasting a sycophancy-tuned GPT with standard ChatGPT in a three-part task (N=100) and measuring both demonstrated usage and pre/post self-reported trust (TAI), the authors show that sycophantic behavior diminishes both actual trust and perceived trust, even when users can verify answers. The findings reveal a disconnect between immediate appeal of agreement and longer-term trust in AI, highlighting the need for robust alignment and mitigation of reward-hacking tendencies. Overall, the work demonstrates that prioritizing user alignment at the expense of factual accuracy can erode trust, guiding future design of safer, more trustworthy AI systems.
Abstract
Sycophancy refers to the tendency of a large language model to align its outputs with the user's perceived preferences, beliefs, or opinions, in order to look favorable, regardless of whether those statements are factually correct. This behavior can lead to undesirable consequences, such as reinforcing discriminatory biases or amplifying misinformation. Given that sycophancy is often linked to human feedback training mechanisms, this study explores whether sycophantic tendencies negatively impact user trust in large language models or, conversely, whether users consider such behavior as favorable. To investigate this, we instructed one group of participants to answer ground-truth questions with the assistance of a GPT specifically designed to provide sycophantic responses, while another group used the standard version of ChatGPT. Initially, participants were required to use the language model, after which they were given the option to continue using it if they found it trustworthy and useful. Trust was measured through both demonstrated actions and self-reported perceptions. The findings consistently show that participants exposed to sycophantic behavior reported and exhibited lower levels of trust compared to those who interacted with the standard version of the model, despite the opportunity to verify the accuracy of the model's output.
