When Large Language Models contradict humans? Large Language Models' Sycophantic Behaviour

Leonardo Ranaldi; Giulia Pucci

When Large Language Models contradict humans? Large Language Models' Sycophantic Behaviour

Leonardo Ranaldi, Giulia Pucci

TL;DR

Large Language Models refine outputs via human feedback, which boosts perceived quality but increases susceptibility to sycophancy. The authors systematically provoke LLMs with belief, mistake, and self-confidence prompts across multiple model families and benchmarks to quantify alignment with user viewpoints versus factual accuracy. They introduce a Non-Contradiction benchmark and analyze the role of model size and training regimen, finding stronger sycophancy in beliefs and mistakes but limited susceptibility in objective tasks, with chameleon-like behavior emerging in larger models. The work highlights reliability concerns for subjective or high-stakes prompts and motivates robustness-focused evaluation in future LLM development.

Abstract

Large Language Models have been demonstrating broadly satisfactory generative abilities for users, which seems to be due to the intensive use of human feedback that refines responses. Nevertheless, suggestibility inherited via human feedback improves the inclination to produce answers corresponding to users' viewpoints. This behaviour is known as sycophancy and depicts the tendency of LLMs to generate misleading responses as long as they align with humans. This phenomenon induces bias and reduces the robustness and, consequently, the reliability of these models. In this paper, we study the suggestibility of Large Language Models (LLMs) to sycophantic behaviour, analysing these tendencies via systematic human-interventions prompts over different tasks. Our investigation demonstrates that LLMs have sycophantic tendencies when answering queries that involve subjective opinions and statements that should elicit a contrary response based on facts. In contrast, when faced with math tasks or queries with an objective answer, they, at various scales, do not follow the users' hints by demonstrating confidence in generating the correct answers.

When Large Language Models contradict humans? Large Language Models' Sycophantic Behaviour

TL;DR

Abstract

Paper Structure (38 sections, 6 figures, 10 tables)

This paper contains 38 sections, 6 figures, 10 tables.

Introduction
Sycophantic Behaviour of LLMs
Beliefs of LLMs
LLMs Falling into Mistakes
Self-Confidence of LLMs
Evaluating Sycophancy
Measuring LLMs Beliefs
Evaluation
Measuring the Fall in the Error of LLMs
Evaluation
Measuring LLMs Self-Confidence
General Commonsense Reasoning:
Physical Interaction:
Social Interaction:
Math Word Problem:
...and 23 more sections

Figures (6)

Figure 1: Example of sycophantic behaviour on a question from PIQA benchmark. In particular, Llama-2-70, despite knowing the correct answer, follows the users' hints and answers incorrectly.
Figure 2: Example of sycophantic behaviour on a question from PHIL-Q. By prompting their (opposing) beliefs on the same topic, users query whether the model agrees or disagrees. In both beliefs, GPTs and Llama-2-70 agree.
Figure 3: Example of our Non-Contradiction Benchmark (Section \ref{['sec:exps_LLMs-Beliefs']}), in particular prompting to "Describe" the well-known poem "To Nature" real written by "Samuel Taylor Coleridge". In this case, the responses of almost all LLMs mimic the users' error.
Figure 4: We investigate the tendency of LLMs to repeat user opinions (sycophancy). Using three benchmark beliefs (§ \ref{['sec:exps_LLMs-Beliefs']}), we estimate the percentage of model responses in agreement with the users' point-of-view.
Figure 5: We investigate the agreement rate with user mistakes in our benchmark (§ \ref{['sec:Non-Contradiction_task']}). The considered LLMs tend to mimic human mistakes also when faced with actual error as in Figure \ref{['fig:task3']}.
...and 1 more figures

When Large Language Models contradict humans? Large Language Models' Sycophantic Behaviour

TL;DR

Abstract

When Large Language Models contradict humans? Large Language Models' Sycophantic Behaviour

Authors

TL;DR

Abstract

Table of Contents

Figures (6)