Table of Contents
Fetching ...

When Large Language Models contradict humans? Large Language Models' Sycophantic Behaviour

Leonardo Ranaldi, Giulia Pucci

TL;DR

Large Language Models refine outputs via human feedback, which boosts perceived quality but increases susceptibility to sycophancy. The authors systematically provoke LLMs with belief, mistake, and self-confidence prompts across multiple model families and benchmarks to quantify alignment with user viewpoints versus factual accuracy. They introduce a Non-Contradiction benchmark and analyze the role of model size and training regimen, finding stronger sycophancy in beliefs and mistakes but limited susceptibility in objective tasks, with chameleon-like behavior emerging in larger models. The work highlights reliability concerns for subjective or high-stakes prompts and motivates robustness-focused evaluation in future LLM development.

Abstract

Large Language Models have been demonstrating broadly satisfactory generative abilities for users, which seems to be due to the intensive use of human feedback that refines responses. Nevertheless, suggestibility inherited via human feedback improves the inclination to produce answers corresponding to users' viewpoints. This behaviour is known as sycophancy and depicts the tendency of LLMs to generate misleading responses as long as they align with humans. This phenomenon induces bias and reduces the robustness and, consequently, the reliability of these models. In this paper, we study the suggestibility of Large Language Models (LLMs) to sycophantic behaviour, analysing these tendencies via systematic human-interventions prompts over different tasks. Our investigation demonstrates that LLMs have sycophantic tendencies when answering queries that involve subjective opinions and statements that should elicit a contrary response based on facts. In contrast, when faced with math tasks or queries with an objective answer, they, at various scales, do not follow the users' hints by demonstrating confidence in generating the correct answers.

When Large Language Models contradict humans? Large Language Models' Sycophantic Behaviour

TL;DR

Large Language Models refine outputs via human feedback, which boosts perceived quality but increases susceptibility to sycophancy. The authors systematically provoke LLMs with belief, mistake, and self-confidence prompts across multiple model families and benchmarks to quantify alignment with user viewpoints versus factual accuracy. They introduce a Non-Contradiction benchmark and analyze the role of model size and training regimen, finding stronger sycophancy in beliefs and mistakes but limited susceptibility in objective tasks, with chameleon-like behavior emerging in larger models. The work highlights reliability concerns for subjective or high-stakes prompts and motivates robustness-focused evaluation in future LLM development.

Abstract

Large Language Models have been demonstrating broadly satisfactory generative abilities for users, which seems to be due to the intensive use of human feedback that refines responses. Nevertheless, suggestibility inherited via human feedback improves the inclination to produce answers corresponding to users' viewpoints. This behaviour is known as sycophancy and depicts the tendency of LLMs to generate misleading responses as long as they align with humans. This phenomenon induces bias and reduces the robustness and, consequently, the reliability of these models. In this paper, we study the suggestibility of Large Language Models (LLMs) to sycophantic behaviour, analysing these tendencies via systematic human-interventions prompts over different tasks. Our investigation demonstrates that LLMs have sycophantic tendencies when answering queries that involve subjective opinions and statements that should elicit a contrary response based on facts. In contrast, when faced with math tasks or queries with an objective answer, they, at various scales, do not follow the users' hints by demonstrating confidence in generating the correct answers.
Paper Structure (38 sections, 6 figures, 10 tables)

This paper contains 38 sections, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Example of sycophantic behaviour on a question from PIQA benchmark. In particular, Llama-2-70, despite knowing the correct answer, follows the users' hints and answers incorrectly.
  • Figure 2: Example of sycophantic behaviour on a question from PHIL-Q. By prompting their (opposing) beliefs on the same topic, users query whether the model agrees or disagrees. In both beliefs, GPTs and Llama-2-70 agree.
  • Figure 3: Example of our Non-Contradiction Benchmark (Section \ref{['sec:exps_LLMs-Beliefs']}), in particular prompting to "Describe" the well-known poem "To Nature" real written by "Samuel Taylor Coleridge". In this case, the responses of almost all LLMs mimic the users' error.
  • Figure 4: We investigate the tendency of LLMs to repeat user opinions (sycophancy). Using three benchmark beliefs (§ \ref{['sec:exps_LLMs-Beliefs']}), we estimate the percentage of model responses in agreement with the users' point-of-view.
  • Figure 5: We investigate the agreement rate with user mistakes in our benchmark (§ \ref{['sec:Non-Contradiction_task']}). The considered LLMs tend to mimic human mistakes also when faced with actual error as in Figure \ref{['fig:task3']}.
  • ...and 1 more figures