Ask don't tell: Reducing sycophancy in large language models

Magda Dubois; Cozmin Ududec; Christopher Summerfield; Lennart Luettgau

Ask don't tell: Reducing sycophancy in large language models

Magda Dubois, Cozmin Ududec, Christopher Summerfield, Lennart Luettgau

TL;DR

It is shown that asking a model to convert non-questions into questions before answering significantly reduces sycophancy, and asking a model to convert non-questions into questions before answering significantly reduces sycophancy.

Abstract

Sycophancy, the tendency of large language models to favour user-affirming responses over critical engagement, has been identified as an alignment failure, particularly in high-stakes advisory and social contexts. While prior work has documented conversational features correlated with sycophancy, we lack a systematic understanding of what provokes or prevents AI sycophancy. Here, we present a set of controlled experimental studies where we first isolate how input framing influences sycophancy, and second, leverage these findings to develop mitigation strategies. In a nested factorial design, we compare questions to various non-questions where we vary three orthogonal factors: epistemic certainty (statement, belief, conviction), perspective (I- vs user-perspective), and affirmation vs negation. We show that (1) sycophancy is substantially higher in response to non-questions compared to questions. Additionally, we find that (2) sycophancy increases monotonically with epistemic certainty conveyed by the user, and (3) is amplified by I-perspective framing. Building on this, we show that asking a model to convert non-questions into questions before answering significantly reduces sycophancy. Importantly, this effect is stronger than a simple baseline prompt asking models "not to be sycophantic". Our work offers a practical and effective input-level mitigation that both developers and users can easily adopt.

Ask don't tell: Reducing sycophancy in large language models

TL;DR

Abstract

Paper Structure (25 sections, 2 equations, 4 figures)

This paper contains 25 sections, 2 equations, 4 figures.

Introduction
Related work
Sycophancy in language models
Input framing
Prompt-based mitigation strategies
Results
Statements, epistemic certainty and I-perspective framing drive sycophancy
Question reframing greatly reduces model sycophancy
User reframing leads to small reductions in model sycophancy
Modulating factors of sycophancy
Results summary
Discussion
Impact
Conclusion
Methods
...and 10 more sections

Figures (4)

Figure 1: (A) Example content-matched prompts across question, non-question inputs (statements, beliefs, convictions), I- vs user-perspective and affirmation/negation conditions. (B) Bayesian GLM estimates with 95% credible intervals (C) Sycophancy scores (LLM-as-a-judge grader-assessed) comparing questions vs non-questions (left), non-questions across different levels of epistemic certainty (middle) and perspective (right). Each point represents one task averaged over 10 epochs and affirmation/negation. Lines connect the same questions across conditions.
Figure 2: Question-reframing mitigations (i.e., question mitigation) design and results: (A) Illustration of prompts and mitigations. (B) Sycophancy LLM-as-a-judge grader score density plots for questions, statements before and after 1- and 2-step question reframing mitigation and the no-sycophancy mitigation. (C) Posterior parameter estimates from best-fitting GLM with 95% credible intervals (lower parameter values = less sycophancy).
Figure 3: Perspective-reframing mitigation (i.e., user mitigation) design and results: (A) Illustration of prompts and mitigations. (B) Sycophancy LLM-as-a-judge grader score density plots for statements before and after 1-step user mitigation. (C) Bayesian GLM estimates with 95% credible intervals.
Figure 4: (A) Topics and subtopics used in the user inputs. The dataset was constructed from 4 topics, 10 subtopics per topic, 1 question per subtopic (40 unique questions in total). (B) Bayesian GLM estimates with 95% credible intervals. (C) Sycophancy LLM-as-a-judge grader score density plots for topics and different models evaluated.

Ask don't tell: Reducing sycophancy in large language models

TL;DR

Abstract

Ask don't tell: Reducing sycophancy in large language models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)