Table of Contents
Fetching ...

Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations

Tuan Dung Nguyen, Minh Khoi Ho, Qi Chen, Yutong Xie, Nguyen Cam-Tu, Minh Khoi Nguyen, Dang Huy Pham Nguyen, Anton van den Hengel, Johan W. Verjans, Phi Le Nguyen, Vu Minh Hieu Phan

Abstract

Large vision-language models (LVLMs) achieve strong performance on visual reasoning tasks but remain highly susceptible to hallucination. Existing detection methods predominantly rely on coarse, whole-image measures of how an object token relates to the input image. This global strategy is limited: hallucinated tokens may exhibit weak but widely scattered correlations across many local regions, which aggregate into deceptively high overall relevance, thus evading the current global hallucination detectors. We begin with a simple yet critical observation: a faithful object token must be strongly grounded in a specific image region. Building on this insight, we introduce a patch-level hallucination detection framework that examines fine-grained token-level interactions across model layers. Our analysis uncovers two characteristic signatures of hallucinated tokens: (i) they yield diffuse, non-localized attention patterns, in contrast to the compact, well-focused attention seen in faithful tokens; and (ii) they fail to exhibit meaningful semantic alignment with any visual region. Guided by these findings, we develop a lightweight and interpretable detection method that leverages patch-level statistical features, combined with hidden-layer representations. Our approach achieves up to 90% accuracy in token-level hallucination detection, demonstrating the superiority of fine-grained structural analysis for detecting hallucinations.

Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations

Abstract

Large vision-language models (LVLMs) achieve strong performance on visual reasoning tasks but remain highly susceptible to hallucination. Existing detection methods predominantly rely on coarse, whole-image measures of how an object token relates to the input image. This global strategy is limited: hallucinated tokens may exhibit weak but widely scattered correlations across many local regions, which aggregate into deceptively high overall relevance, thus evading the current global hallucination detectors. We begin with a simple yet critical observation: a faithful object token must be strongly grounded in a specific image region. Building on this insight, we introduce a patch-level hallucination detection framework that examines fine-grained token-level interactions across model layers. Our analysis uncovers two characteristic signatures of hallucinated tokens: (i) they yield diffuse, non-localized attention patterns, in contrast to the compact, well-focused attention seen in faithful tokens; and (ii) they fail to exhibit meaningful semantic alignment with any visual region. Guided by these findings, we develop a lightweight and interpretable detection method that leverages patch-level statistical features, combined with hidden-layer representations. Our approach achieves up to 90% accuracy in token-level hallucination detection, demonstrating the superiority of fine-grained structural analysis for detecting hallucinations.

Paper Structure

This paper contains 22 sections, 10 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Left: SVAR jiang2025devils captures image-level statistics by globally summing the attention across the image, ignoring local attention structures. This triggers a false alarm when the local noisy attention appears. Right: Our proposed “attention dispersion score” encoded token-level attention distribution, quantifying how much the model allocates attention to various local regions simultaneously. Spreading the focus across multiple regions, i.e. high entropy, indicates hallucination. Specifically, our pipeline first suppresses noisy regions and detects focused objects by grouping, and then quantifies the attention entropy of the token-level distributions. Low entropy means the model “sharply” focuses on particular objects, thus being likely to predict true objects. In other words, high entropy means the model is unfocused when predicting, indicating hallucination.
  • Figure 2: Overview of our token-level hallucination detection framework. We reveal two key indicators of hallucination: (1)Attention Dispersion Score (ADS), hallucinated tokens exhibit highly diffused, non-localized attention across local image patches, while faithful tokens show concentrated focus; and (2)Cross-modal Grounding Consistency (CGC), hallucinated text tokens exhibit low alignment with any object regions, as shown in the right figure. Our proposed metrics encode local structures of the VLMs' behaviors, enabling lightweight and explainable hallucination detection robust to attention sink scenarios.
  • Figure 3: Illustration of the Attention Dispersion Score (ADS) computation. After predicting an object token, we extract the text-to-patch cross-modal attention map. The top $k\%$ activations are kept to isolate highly focused regions. Then, we form the attended object regions and suppress attention sinks by applying $N$-connected component. Finally, the proposed ADS score is computed by quantifying the entropy of the resulting token-level attention distribution. The peaky distribution indicates sharp object focus, indicating real object token predictions. In contrast, the uniform patch-wise distribution implies that the model scatters its attention when predicting, indicating hallucinated predictions.
  • Figure 4: The visual attention maps for or a true token ("camera", top) vs. a hallucinated token ("wristwatch", bottom) across layers (10, 17, 22). True tokens reveal a focused attention pattern aligned with the object’s location, while hallucination tokens have scattered attention across the image.
  • Figure 5: Layer-wise attention entropy of true vs. hallucinated tokens across LVLMs (lower is better; more focused). Reported $p$-values indicate strong separation in early/mid layers.
  • ...and 6 more figures