Table of Contents
Fetching ...

When Do LLMs Admit Their Mistakes? Understanding The Role Of Model Belief In Retraction

Yuqing Yang, Robin Jia

TL;DR

This work tackles when LLMs admit mistakes by examining spontaneous retraction. It introduces model-specific continuation datasets to test retraction, and uses internal-belief probes and activation steering to reveal that momentary beliefs predict and causally drive retraction, rather than stored parametric knowledge alone. The authors demonstrate that steering beliefs alters not only next-token decisions but also attention dynamics, and that supervised fine-tuning improves internal beliefs and retraction performance. The findings offer a mechanistic path to enhancing LLM reliability by aligning beliefs with truth and exploiting belief-driven control over generation, with observed generalization across model scales and families. Overall, the study advances understanding of retraction as a belief-driven, manipulable aspect of LLM behavior with practical implications for trustworthy AI systems.

Abstract

Can large language models (LLMs) admit their mistakes when they should know better? In this work, we study when and why LLMs choose to retract, i.e., spontaneously and immediately acknowledge their errors. Using model-specific testbeds, we find that while LLMs are capable of retraction, they do so only rarely, even when they can recognize their mistakes when asked in a separate interaction. We identify a reliable predictor of retraction: the model's momentary belief, as measured by a probe on its internal states that is trained to predict correctness on external datasets unrelated to retraction. A model retracts only when it "believes" its answers to be incorrect during generation; these beliefs frequently diverge from models' parametric knowledge as measured by factoid questions. Steering experiments further demonstrate that model belief causally drives retraction. In particular, when the model believes its answer to be incorrect, this not only encourages the model to attempt further verification, but also alters attention dynamics. Finally, we show that supervised fine-tuning improves retraction performance by helping the model learn more accurate internal belief. Code and datasets are available on https://github.com/ayyyq/llm-retraction .

When Do LLMs Admit Their Mistakes? Understanding The Role Of Model Belief In Retraction

TL;DR

This work tackles when LLMs admit mistakes by examining spontaneous retraction. It introduces model-specific continuation datasets to test retraction, and uses internal-belief probes and activation steering to reveal that momentary beliefs predict and causally drive retraction, rather than stored parametric knowledge alone. The authors demonstrate that steering beliefs alters not only next-token decisions but also attention dynamics, and that supervised fine-tuning improves internal beliefs and retraction performance. The findings offer a mechanistic path to enhancing LLM reliability by aligning beliefs with truth and exploiting belief-driven control over generation, with observed generalization across model scales and families. Overall, the study advances understanding of retraction as a belief-driven, manipulable aspect of LLM behavior with practical implications for trustworthy AI systems.

Abstract

Can large language models (LLMs) admit their mistakes when they should know better? In this work, we study when and why LLMs choose to retract, i.e., spontaneously and immediately acknowledge their errors. Using model-specific testbeds, we find that while LLMs are capable of retraction, they do so only rarely, even when they can recognize their mistakes when asked in a separate interaction. We identify a reliable predictor of retraction: the model's momentary belief, as measured by a probe on its internal states that is trained to predict correctness on external datasets unrelated to retraction. A model retracts only when it "believes" its answers to be incorrect during generation; these beliefs frequently diverge from models' parametric knowledge as measured by factoid questions. Steering experiments further demonstrate that model belief causally drives retraction. In particular, when the model believes its answer to be incorrect, this not only encourages the model to attempt further verification, but also alters attention dynamics. Finally, we show that supervised fine-tuning improves retraction performance by helping the model learn more accurate internal belief. Code and datasets are available on https://github.com/ayyyq/llm-retraction .

Paper Structure

This paper contains 49 sections, 5 equations, 9 figures, 21 tables.

Figures (9)

  • Figure 1: indicates a correct answer, indicates a wrong answer, and denotes a retraction. We investigate when LLMs fail to retract, even when they know the answer is wrong in verification questions.
  • Figure 2: Layer-wise AUROC of belief scores for predicting factual correctness and retraction in Llama3.1-8B. An AUROC of 0.5 corresponds to random guessing. Results are averaged over three runs with different random seeds, and error bars denote standard deviation.
  • Figure 3: Retraction rate under belief steering. "Belief-" denotes negative belief steering while "Belief+" denotes positive belief steering.
  • Figure 4: Layer-wise AUROC of belief scores in Llama3.1-8B (Base) and its fine-tuned variant (SFT).
  • Figure 5: Layer-wise AUROC of belief scores for factual correctness and retraction of Qwen2.5-7B and Olmo2-7B on the Wikidata test set.
  • ...and 4 more figures