Table of Contents
Fetching ...

Flattering to Deceive: The Impact of Sycophantic Behavior on User Trust in Large Language Model

María Victoria Carro

TL;DR

This study examines how sycophantic outputs from a large language model affect user trust. By contrasting a sycophancy-tuned GPT with standard ChatGPT in a three-part task (N=100) and measuring both demonstrated usage and pre/post self-reported trust (TAI), the authors show that sycophantic behavior diminishes both actual trust and perceived trust, even when users can verify answers. The findings reveal a disconnect between immediate appeal of agreement and longer-term trust in AI, highlighting the need for robust alignment and mitigation of reward-hacking tendencies. Overall, the work demonstrates that prioritizing user alignment at the expense of factual accuracy can erode trust, guiding future design of safer, more trustworthy AI systems.

Abstract

Sycophancy refers to the tendency of a large language model to align its outputs with the user's perceived preferences, beliefs, or opinions, in order to look favorable, regardless of whether those statements are factually correct. This behavior can lead to undesirable consequences, such as reinforcing discriminatory biases or amplifying misinformation. Given that sycophancy is often linked to human feedback training mechanisms, this study explores whether sycophantic tendencies negatively impact user trust in large language models or, conversely, whether users consider such behavior as favorable. To investigate this, we instructed one group of participants to answer ground-truth questions with the assistance of a GPT specifically designed to provide sycophantic responses, while another group used the standard version of ChatGPT. Initially, participants were required to use the language model, after which they were given the option to continue using it if they found it trustworthy and useful. Trust was measured through both demonstrated actions and self-reported perceptions. The findings consistently show that participants exposed to sycophantic behavior reported and exhibited lower levels of trust compared to those who interacted with the standard version of the model, despite the opportunity to verify the accuracy of the model's output.

Flattering to Deceive: The Impact of Sycophantic Behavior on User Trust in Large Language Model

TL;DR

This study examines how sycophantic outputs from a large language model affect user trust. By contrasting a sycophancy-tuned GPT with standard ChatGPT in a three-part task (N=100) and measuring both demonstrated usage and pre/post self-reported trust (TAI), the authors show that sycophantic behavior diminishes both actual trust and perceived trust, even when users can verify answers. The findings reveal a disconnect between immediate appeal of agreement and longer-term trust in AI, highlighting the need for robust alignment and mitigation of reward-hacking tendencies. Overall, the work demonstrates that prioritizing user alignment at the expense of factual accuracy can erode trust, guiding future design of safer, more trustworthy AI systems.

Abstract

Sycophancy refers to the tendency of a large language model to align its outputs with the user's perceived preferences, beliefs, or opinions, in order to look favorable, regardless of whether those statements are factually correct. This behavior can lead to undesirable consequences, such as reinforcing discriminatory biases or amplifying misinformation. Given that sycophancy is often linked to human feedback training mechanisms, this study explores whether sycophantic tendencies negatively impact user trust in large language models or, conversely, whether users consider such behavior as favorable. To investigate this, we instructed one group of participants to answer ground-truth questions with the assistance of a GPT specifically designed to provide sycophantic responses, while another group used the standard version of ChatGPT. Initially, participants were required to use the language model, after which they were given the option to continue using it if they found it trustworthy and useful. Trust was measured through both demonstrated actions and self-reported perceptions. The findings consistently show that participants exposed to sycophantic behavior reported and exhibited lower levels of trust compared to those who interacted with the standard version of the model, despite the opportunity to verify the accuracy of the model's output.

Paper Structure

This paper contains 23 sections, 5 figures.

Figures (5)

  • Figure 1: The first part of the task, based on a main question, requiring participants to use a language model---standard ChatGPT for the control group and a custom GPT model for the treatment group---and submit a final response.
  • Figure 2: Demonstrated trust results, illustrating the number of times participants from each group either trusted or skipped the language model during each component of the task.
  • Figure 3: Mean Likert scale scores assigned to each item by treatment group, both before and after the task. A score of 1 indicates 'I strongly agree', while a score of 5 indicates 'I strongly disagree'.
  • Figure 4: Mean Likert scale scores assigned to each item by control group, both before and after the task. A score of 1 indicates 'I strongly agree', while a score of 5 indicates 'I strongly disagree'.
  • Figure 5: The task interface. On the left, a screenshot of the control group's form; on the right, a screenshot of the treatment group's form.