Table of Contents
Fetching ...

Segmentation-Based Attention Entropy: Detecting and Mitigating Object Hallucinations in Large Vision-Language Models

Jiale Song, Jiaxin Luo, Xue-song Tang, Kuangrong Hao, Mingbo Zhao

Abstract

Large Vision-Language Models (LVLMs) achieve strong performance on many multimodal tasks, but object hallucinations severely undermine their reliability. Most existing studies focus on the text modality, attributing hallucinations to overly strong language priors and insufficient visual grounding. In contrast, we observe that abnormal attention patterns within the visual modality can also give rise to hallucinated objects. Building on this observation, we propose Segmentation-based Attention Entropy (SAE), which leverages semantic segmentation to quantify visual attention uncertainty in an object-level semantic space. Based on SAE, we further design a reliability score for hallucination detection and an SAE-guided attention adjustment method that modifies visual attention at inference time to mitigate hallucinations. We evaluate our approach on public benchmarks and in real embodied multimodal scenarios with quadruped robots. Experimental results show that SAE substantially reduces object hallucinations without any additional training cost, thereby enabling more trustworthy LVLM-driven perception and decision-making.

Segmentation-Based Attention Entropy: Detecting and Mitigating Object Hallucinations in Large Vision-Language Models

Abstract

Large Vision-Language Models (LVLMs) achieve strong performance on many multimodal tasks, but object hallucinations severely undermine their reliability. Most existing studies focus on the text modality, attributing hallucinations to overly strong language priors and insufficient visual grounding. In contrast, we observe that abnormal attention patterns within the visual modality can also give rise to hallucinated objects. Building on this observation, we propose Segmentation-based Attention Entropy (SAE), which leverages semantic segmentation to quantify visual attention uncertainty in an object-level semantic space. Based on SAE, we further design a reliability score for hallucination detection and an SAE-guided attention adjustment method that modifies visual attention at inference time to mitigate hallucinations. We evaluate our approach on public benchmarks and in real embodied multimodal scenarios with quadruped robots. Experimental results show that SAE substantially reduces object hallucinations without any additional training cost, thereby enabling more trustworthy LVLM-driven perception and decision-making.
Paper Structure (12 sections, 7 equations, 7 figures, 1 table)

This paper contains 12 sections, 7 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Object hallucinations in LVLMs and SAE. (a) Example of real (green) and hallucinated (red) objects in an LVLM description. (b) SAE pipeline: real objects correspond to low SAE, hallucinated objects to high SAE.
  • Figure 2: Head-level attention distributions for real objects (green) and hallucinated objects (red). Real objects have low SAE with attention concentrated on the corresponding image tokens, whereas hallucinated objects have high SAE with attention spread over many irrelevant image tokens.
  • Figure 3: Layer-head SAE statistics. The horizontal axis denotes the head index and the vertical axis the layer index, and the difference between real and hallucinated objects is most pronounced in the middle layers.
  • Figure 4: Qualitative example of SAE-guided attention intervention during inference. SAE-guided modulation reduces hallucinated objects (red) such as a spoon, a dining table, and a dog, while preserving or even completing real objects (green).
  • Figure 5: SAE-Guided Navigation Pipeline: The robot takes a first-person RGB-D image and a language instruction as inputs to the LVLM, with the output being an action path in the physical world.
  • ...and 2 more figures