Table of Contents
Fetching ...

TRUTH DECAY: Quantifying Multi-Turn Sycophancy in Language Models

Joshua Liu, Aarav Jain, Soham Takuri, Srihan Vege, Aslihan Akalin, Kevin Zhu, Sean O'Brien, Vasu Sharma

TL;DR

This paper introduces Truth Decay, a benchmark to quantify sycophancy in multi-turn language-model conversations and to evaluate mitigation strategies. It compares static and rationale-based feedback mechanisms, using prompts to reduce sycophancy and assessing effects with TruthfulQA and MMLU-Pro across three public models. The results reveal that sycophancy compounds across turns, degrading factual accuracy, and that initial wrong answers make models more prone to subsequent changes; rationale-based followups can further destabilize outputs. The work highlights a critical gap in current alignment practices for long dialogues and underscores the need for truth-preserving strategies in real-world, multi-turn AI assistants.

Abstract

Rapid improvements in large language models have unveiled a critical challenge in human-AI interaction: sycophancy. In this context, sycophancy refers to the tendency of models to excessively agree with or flatter users, often at the expense of factual accuracy. While previous studies have primarily analyzed this behavior in single-turn interactions, its persistence and evolution in multi-step conversations remain largely unexplored. We introduce TRUTH DECAY, a benchmark specifically designed to evaluate sycophancy in extended dialogues, where language models must navigate iterative user feedback, challenges, and persuasion. We prompt models to elicit four types of sycophantic biases. We then propose and test sycophancy reduction strategies, evaluating their effectiveness beyond single-step interactions.

TRUTH DECAY: Quantifying Multi-Turn Sycophancy in Language Models

TL;DR

This paper introduces Truth Decay, a benchmark to quantify sycophancy in multi-turn language-model conversations and to evaluate mitigation strategies. It compares static and rationale-based feedback mechanisms, using prompts to reduce sycophancy and assessing effects with TruthfulQA and MMLU-Pro across three public models. The results reveal that sycophancy compounds across turns, degrading factual accuracy, and that initial wrong answers make models more prone to subsequent changes; rationale-based followups can further destabilize outputs. The work highlights a critical gap in current alignment practices for long dialogues and underscores the need for truth-preserving strategies in real-world, multi-turn AI assistants.

Abstract

Rapid improvements in large language models have unveiled a critical challenge in human-AI interaction: sycophancy. In this context, sycophancy refers to the tendency of models to excessively agree with or flatter users, often at the expense of factual accuracy. While previous studies have primarily analyzed this behavior in single-turn interactions, its persistence and evolution in multi-step conversations remain largely unexplored. We introduce TRUTH DECAY, a benchmark specifically designed to evaluate sycophancy in extended dialogues, where language models must navigate iterative user feedback, challenges, and persuasion. We prompt models to elicit four types of sycophantic biases. We then propose and test sycophancy reduction strategies, evaluating their effectiveness beyond single-step interactions.

Paper Structure

This paper contains 39 sections, 21 figures, 1 table.

Figures (21)

  • Figure 1: A visual description of our static follow-up pipeline. From this method, the bias is prompted in the language model for n follow-ups. Through this, we simulate general, human-like conversations from pre-generated templates.
  • Figure 2: A visual description of our dynamic rationale follow-up pipeline. Through this, we create more informative conversations challenging the model’s responses and providing their own reasoning or counterarguments
  • Figure 3: Average Change Per Followup, Claude MMLU-Pro. When a LLM originally answers incorrectly, it experiences up to 40% higher change percentages than when it has an initially correct answer.
  • Figure 4: Accuracy Degradation on Claude MMLU-Pro. Across all domains, average accuracy decreases. Specifically, fields with subjective answers, such as philosophy, experience higher decreases in accuracy than objective fields, such as math.
  • Figure 5: OpenAI Rationale Truthful with no sycophancy reduction method and feedback sycophancy bias.
  • ...and 16 more figures