Table of Contents
Fetching ...

Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models

Abin Shoby, Ta Duc Huy, Tuan Dung Nguyen, Minh Khoi Ho, Qi Chen, Anton van den Hengel, Phi Le Nguyen, Johan W. Verjans, Vu Minh Hieu Phan

TL;DR

By probing decoder layers, the Overthinking Score is introduced, a metric to measure how many competing hypotheses the model entertains and how unstable these hypotheses are across layers, which significantly improves hallucination detection.

Abstract

Vision Language models (VLMs) often hallucinate non-existent objects. Detecting hallucination is analogous to detecting deception: a single final statement is insufficient, one must examine the underlying reasoning process. Yet existing detectors rely mostly on final-layer signals. Attention-based methods assume hallucinated tokens exhibit low attention, while entropy-based ones use final-step uncertainty. Our analysis reveals the opposite: hallucinated objects can exhibit peaked attention due to contextual priors; and models often express high confidence because intermediate layers have already converged to an incorrect hypothesis. We show that the key to hallucination detection lies within the model's thought process, not its final output. By probing decoder layers, we uncover a previously overlooked behavior, overthinking: models repeatedly revise object hypotheses across layers before committing to an incorrect answer. Once the model latches onto a confounded hypothesis, it can propagate through subsequent layers, ultimately causing hallucination. To capture this behavior, we introduce the Overthinking Score, a metric to measure how many competing hypotheses the model entertains and how unstable these hypotheses are across layers. This score significantly improves hallucination detection: 78.9% F1 on MSCOCO and 71.58% on AMBER.

Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models

TL;DR

By probing decoder layers, the Overthinking Score is introduced, a metric to measure how many competing hypotheses the model entertains and how unstable these hypotheses are across layers, which significantly improves hallucination detection.

Abstract

Vision Language models (VLMs) often hallucinate non-existent objects. Detecting hallucination is analogous to detecting deception: a single final statement is insufficient, one must examine the underlying reasoning process. Yet existing detectors rely mostly on final-layer signals. Attention-based methods assume hallucinated tokens exhibit low attention, while entropy-based ones use final-step uncertainty. Our analysis reveals the opposite: hallucinated objects can exhibit peaked attention due to contextual priors; and models often express high confidence because intermediate layers have already converged to an incorrect hypothesis. We show that the key to hallucination detection lies within the model's thought process, not its final output. By probing decoder layers, we uncover a previously overlooked behavior, overthinking: models repeatedly revise object hypotheses across layers before committing to an incorrect answer. Once the model latches onto a confounded hypothesis, it can propagate through subsequent layers, ultimately causing hallucination. To capture this behavior, we introduce the Overthinking Score, a metric to measure how many competing hypotheses the model entertains and how unstable these hypotheses are across layers. This score significantly improves hallucination detection: 78.9% F1 on MSCOCO and 71.58% on AMBER.
Paper Structure (26 sections, 7 equations, 17 figures, 10 tables, 2 algorithms)

This paper contains 26 sections, 7 equations, 17 figures, 10 tables, 2 algorithms.

Figures (17)

  • Figure 1: Overthinking leads to object hallucination in Vision-Language Models. Each column corresponds to the top predicted object token at a decoder layer, illustrating the model’s layer-wise reasoning progression. We propose Overthinking Score (S-OT) to measure how much the model shifts among objects across layers. Top: the model demonstrates stable reasoning, quickly converges on a consistent concept (cat) across decoder layers, yielding low S-OT. Bottom: the model shows overthinking, hesitating between semantically co-occurring objects or "confounders" (sink, soap) that bias it towards confidently producing a hallucinated answer (dish) captured by high S-OT.
  • Figure 2: Left:Confounder Propagation examples. In each example, we illustrate the image and its prefix prompt on top. Under the image, we highlight confounders in the intermediate layers prior to the hallucinated final layer token. We also list the values of different hallucination indicators: SVAR svar, MetaToken metatoken, Final Layer Entropy, and our proposed Overthinking Score (S-OT). The attention map of the fifth layer is also presented following svar. Existing methods ignore intermediate-layer token dynamics, missing hallucinations. In contrast, our approach detects by tracing confounder propagation. Right: Histograms of SVAR, MetaToken, Final-Layer Entropy, and our Overthinking Score are shown, with the example positions highlighted. Existing indicators show high overlapping between Real and Hallucinated object distribution, while our Overthinking Score offers noticeably better separation.
  • Figure 3: Distribution of final layer entropy in LLaVA-1.5, Gemma-3 and Qwen3-VL for hallucinated and real tokens.The strong overlap shows the weakness of entropy as a hallucination predictor.
  • Figure 4: Left: Correlation between per-layer average entropy and hallucination rate. Most layers exhibit positive correlation, suggesting that increased uncertainty across depth contributes to hallucination. Right: Confounder propagation rate versus the number of unique tokens in the intermediate layers. A higher number of unique tokens is associated with an increased likelihood of confounder propagation. Both measured on LlaVA-1.5
  • Figure 5: Hallucination detection pipeline. 1) We begin with Prefix Prompting, where the model is asked to predict the next token given an image and a partial prompt. 2) We apply LogitLens to extract the top-$p$ tokens at each decoder layer, revealing how the model’s intermediate hypotheses evolve. 3) In the Feature Extraction step, the top-$p$ tokens distributions are used to compute the Overthinking Score and layer-wise entropy, while image and text attention are measured with respect to the generated token. All features are concatenated into a single feature vector. 4) A Hallucination Detector is trained on these features to identify hallucinated cases.
  • ...and 12 more figures