When Large Language Models contradict humans? Large Language Models' Sycophantic Behaviour
Leonardo Ranaldi, Giulia Pucci
TL;DR
Large Language Models refine outputs via human feedback, which boosts perceived quality but increases susceptibility to sycophancy. The authors systematically provoke LLMs with belief, mistake, and self-confidence prompts across multiple model families and benchmarks to quantify alignment with user viewpoints versus factual accuracy. They introduce a Non-Contradiction benchmark and analyze the role of model size and training regimen, finding stronger sycophancy in beliefs and mistakes but limited susceptibility in objective tasks, with chameleon-like behavior emerging in larger models. The work highlights reliability concerns for subjective or high-stakes prompts and motivates robustness-focused evaluation in future LLM development.
Abstract
Large Language Models have been demonstrating broadly satisfactory generative abilities for users, which seems to be due to the intensive use of human feedback that refines responses. Nevertheless, suggestibility inherited via human feedback improves the inclination to produce answers corresponding to users' viewpoints. This behaviour is known as sycophancy and depicts the tendency of LLMs to generate misleading responses as long as they align with humans. This phenomenon induces bias and reduces the robustness and, consequently, the reliability of these models. In this paper, we study the suggestibility of Large Language Models (LLMs) to sycophantic behaviour, analysing these tendencies via systematic human-interventions prompts over different tasks. Our investigation demonstrates that LLMs have sycophantic tendencies when answering queries that involve subjective opinions and statements that should elicit a contrary response based on facts. In contrast, when faced with math tasks or queries with an objective answer, they, at various scales, do not follow the users' hints by demonstrating confidence in generating the correct answers.
