Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

Wenbin An; Feng Tian; Sicong Leng; Jiahao Nie; Haonan Lin; QianYing Wang; Ping Chen; Xiaoqin Zhang; Shijian Lu

Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Ping Chen, Xiaoqin Zhang, Shijian Lu

TL;DR

This work addresses object hallucination in large vision-language models by identifying an attention deficiency that overemphasizes global image features. It introduces Image-Prompt Matching (IPM) to generate a prompt-relevant augmented view and Assembly of Global and Local Attention (AGLA) to fuse global generation with local discrimination at decoding time, in a training-free, plug-and-play manner. Across POPE, ROPE, MME, CHAIR, and LLaVA-Bench-Wild, AGLA yields consistent improvements in hallucination mitigation and perception tasks, validating its broad applicability to discriminative and generative multimodal tasks. The authors release code and demonstrate that combining local prompt-relevant cues with global generative signals substantially enhances visual grounding and caption quality.

Abstract

Despite great success across various multimodal tasks, Large Vision-Language Models (LVLMs) often encounter object hallucinations with generated textual responses being inconsistent with the actual objects in images. We examine different LVLMs and pinpoint that one root cause of object hallucinations lies with deficient attention on discriminative image features. Specifically, LVLMs often predominantly attend to prompt-irrelevant global features instead of prompt-relevant local features, undermining their visual grounding capacity and leading to object hallucinations. We propose Assembly of Global and Local Attention (AGLA), a training-free and plug-and-play approach that mitigates hallucinations by assembling global features for response generation and local features for visual discrimination simultaneously. Specifically, we introduce an image-prompt matching scheme that captures prompt-relevant local features from images, leading to an augmented view of the input image where prompt-relevant content is highlighted while irrelevant distractions are suppressed. Hallucinations can thus be mitigated with a calibrated logit distribution that is from generative global features of the original image and discriminative local features of the augmented image. Extensive experiments show the superiority of AGLA in LVLM hallucination mitigation, demonstrating its wide applicability across both discriminative and generative tasks. Our code is available at https://github.com/Lackel/AGLA.

Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

TL;DR

Abstract

Paper Structure (28 sections, 7 equations, 12 figures, 16 tables)

This paper contains 28 sections, 7 equations, 12 figures, 16 tables.

Introduction
Related Work
Large Vision-Language Models
Object Hallucination
Method
Image-Prompt Matching
Assembly of Global and Local Attention
Experiments
Experimental Settings
Experimental Results
Ablation Study
Discussion
Conclusion
Acknowledgments
Limitation and Future Work
...and 13 more sections

Figures (12)

Figure 1: Efficacy of AGLA under different settings of the POPE dataset li2023evaluating. A lower F1 score means a higher hallucination rate.
Figure 2: The weights of LVLM self-attention with respect to image patch features when LVLM responds to different object queries (Yes or No). The first row shows the LVLM self-attention toward the original image, where the attention is dominated by certain global features and demonstrates similar patterns consistently regardless of the queried objects. The second row shows the LVLM self-attention toward augmented views of the image, where the attention is more prompt-relevant and captures query-relevant local features.
Figure 3: An illustration of the proposed Image-Prompt Matching. $sim(\textit{v}, \textit{t})$ is an output score of the matching model, which measures the similarity between image v and prompt t.
Figure 4: Performances with original or augmented input images.
Figure 5: Decoding with Assembly of Global and Local Attention.
...and 7 more figures

Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

TL;DR

Abstract

Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

Authors

TL;DR

Abstract

Table of Contents

Figures (12)