Table of Contents
Fetching ...

Ask don't tell: Reducing sycophancy in large language models

Magda Dubois, Cozmin Ududec, Christopher Summerfield, Lennart Luettgau

TL;DR

It is shown that asking a model to convert non-questions into questions before answering significantly reduces sycophancy, and asking a model to convert non-questions into questions before answering significantly reduces sycophancy.

Abstract

Sycophancy, the tendency of large language models to favour user-affirming responses over critical engagement, has been identified as an alignment failure, particularly in high-stakes advisory and social contexts. While prior work has documented conversational features correlated with sycophancy, we lack a systematic understanding of what provokes or prevents AI sycophancy. Here, we present a set of controlled experimental studies where we first isolate how input framing influences sycophancy, and second, leverage these findings to develop mitigation strategies. In a nested factorial design, we compare questions to various non-questions where we vary three orthogonal factors: epistemic certainty (statement, belief, conviction), perspective (I- vs user-perspective), and affirmation vs negation. We show that (1) sycophancy is substantially higher in response to non-questions compared to questions. Additionally, we find that (2) sycophancy increases monotonically with epistemic certainty conveyed by the user, and (3) is amplified by I-perspective framing. Building on this, we show that asking a model to convert non-questions into questions before answering significantly reduces sycophancy. Importantly, this effect is stronger than a simple baseline prompt asking models "not to be sycophantic". Our work offers a practical and effective input-level mitigation that both developers and users can easily adopt.

Ask don't tell: Reducing sycophancy in large language models

TL;DR

It is shown that asking a model to convert non-questions into questions before answering significantly reduces sycophancy, and asking a model to convert non-questions into questions before answering significantly reduces sycophancy.

Abstract

Sycophancy, the tendency of large language models to favour user-affirming responses over critical engagement, has been identified as an alignment failure, particularly in high-stakes advisory and social contexts. While prior work has documented conversational features correlated with sycophancy, we lack a systematic understanding of what provokes or prevents AI sycophancy. Here, we present a set of controlled experimental studies where we first isolate how input framing influences sycophancy, and second, leverage these findings to develop mitigation strategies. In a nested factorial design, we compare questions to various non-questions where we vary three orthogonal factors: epistemic certainty (statement, belief, conviction), perspective (I- vs user-perspective), and affirmation vs negation. We show that (1) sycophancy is substantially higher in response to non-questions compared to questions. Additionally, we find that (2) sycophancy increases monotonically with epistemic certainty conveyed by the user, and (3) is amplified by I-perspective framing. Building on this, we show that asking a model to convert non-questions into questions before answering significantly reduces sycophancy. Importantly, this effect is stronger than a simple baseline prompt asking models "not to be sycophantic". Our work offers a practical and effective input-level mitigation that both developers and users can easily adopt.
Paper Structure (25 sections, 2 equations, 4 figures)

This paper contains 25 sections, 2 equations, 4 figures.

Figures (4)

  • Figure 1: (A) Example content-matched prompts across question, non-question inputs (statements, beliefs, convictions), I- vs user-perspective and affirmation/negation conditions. (B) Bayesian GLM estimates with 95% credible intervals (C) Sycophancy scores (LLM-as-a-judge grader-assessed) comparing questions vs non-questions (left), non-questions across different levels of epistemic certainty (middle) and perspective (right). Each point represents one task averaged over 10 epochs and affirmation/negation. Lines connect the same questions across conditions.
  • Figure 2: Question-reframing mitigations (i.e., question mitigation) design and results: (A) Illustration of prompts and mitigations. (B) Sycophancy LLM-as-a-judge grader score density plots for questions, statements before and after 1- and 2-step question reframing mitigation and the no-sycophancy mitigation. (C) Posterior parameter estimates from best-fitting GLM with 95% credible intervals (lower parameter values = less sycophancy).
  • Figure 3: Perspective-reframing mitigation (i.e., user mitigation) design and results: (A) Illustration of prompts and mitigations. (B) Sycophancy LLM-as-a-judge grader score density plots for statements before and after 1-step user mitigation. (C) Bayesian GLM estimates with 95% credible intervals.
  • Figure 4: (A) Topics and subtopics used in the user inputs. The dataset was constructed from 4 topics, 10 subtopics per topic, 1 question per subtopic (40 unique questions in total). (B) Bayesian GLM estimates with 95% credible intervals. (C) Sycophancy LLM-as-a-judge grader score density plots for topics and different models evaluated.