TRUTH DECAY: Quantifying Multi-Turn Sycophancy in Language Models
Joshua Liu, Aarav Jain, Soham Takuri, Srihan Vege, Aslihan Akalin, Kevin Zhu, Sean O'Brien, Vasu Sharma
TL;DR
This paper introduces Truth Decay, a benchmark to quantify sycophancy in multi-turn language-model conversations and to evaluate mitigation strategies. It compares static and rationale-based feedback mechanisms, using prompts to reduce sycophancy and assessing effects with TruthfulQA and MMLU-Pro across three public models. The results reveal that sycophancy compounds across turns, degrading factual accuracy, and that initial wrong answers make models more prone to subsequent changes; rationale-based followups can further destabilize outputs. The work highlights a critical gap in current alignment practices for long dialogues and underscores the need for truth-preserving strategies in real-world, multi-turn AI assistants.
Abstract
Rapid improvements in large language models have unveiled a critical challenge in human-AI interaction: sycophancy. In this context, sycophancy refers to the tendency of models to excessively agree with or flatter users, often at the expense of factual accuracy. While previous studies have primarily analyzed this behavior in single-turn interactions, its persistence and evolution in multi-step conversations remain largely unexplored. We introduce TRUTH DECAY, a benchmark specifically designed to evaluate sycophancy in extended dialogues, where language models must navigate iterative user feedback, challenges, and persuasion. We prompt models to elicit four types of sycophantic biases. We then propose and test sycophancy reduction strategies, evaluating their effectiveness beyond single-step interactions.
