When Do LLMs Admit Their Mistakes? Understanding The Role Of Model Belief In Retraction
Yuqing Yang, Robin Jia
TL;DR
This work tackles when LLMs admit mistakes by examining spontaneous retraction. It introduces model-specific continuation datasets to test retraction, and uses internal-belief probes and activation steering to reveal that momentary beliefs predict and causally drive retraction, rather than stored parametric knowledge alone. The authors demonstrate that steering beliefs alters not only next-token decisions but also attention dynamics, and that supervised fine-tuning improves internal beliefs and retraction performance. The findings offer a mechanistic path to enhancing LLM reliability by aligning beliefs with truth and exploiting belief-driven control over generation, with observed generalization across model scales and families. Overall, the study advances understanding of retraction as a belief-driven, manipulable aspect of LLM behavior with practical implications for trustworthy AI systems.
Abstract
Can large language models (LLMs) admit their mistakes when they should know better? In this work, we study when and why LLMs choose to retract, i.e., spontaneously and immediately acknowledge their errors. Using model-specific testbeds, we find that while LLMs are capable of retraction, they do so only rarely, even when they can recognize their mistakes when asked in a separate interaction. We identify a reliable predictor of retraction: the model's momentary belief, as measured by a probe on its internal states that is trained to predict correctness on external datasets unrelated to retraction. A model retracts only when it "believes" its answers to be incorrect during generation; these beliefs frequently diverge from models' parametric knowledge as measured by factoid questions. Steering experiments further demonstrate that model belief causally drives retraction. In particular, when the model believes its answer to be incorrect, this not only encourages the model to attempt further verification, but also alters attention dynamics. Finally, we show that supervised fine-tuning improves retraction performance by helping the model learn more accurate internal belief. Code and datasets are available on https://github.com/ayyyq/llm-retraction .
