Table of Contents
Fetching ...

DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination

Xuan Gong, Tianshi Ming, Xinpeng Wang, Zhihua Wei

TL;DR

DAMRO investigates the link between ViT visual encoder attention and LLM decoder attention, revealing that high-attention outlier tokens correlate with object hallucination. It then provides a training-free solution that (1) selects top $k$ outlier tokens via the [CLS] attention and (2) applies adaptive contrastive decoding to downweight these tokens during generation. Across LLaVA-1.5, LLaVA-NeXT, and InstructBLIP, and benchmarks POPE, CHAIR, MME, and GPT-4V-aided evaluation, DAMRO consistently reduces hallucinations and improves factual alignment without external data. The approach shows strong generalizability due to its reliance on attention patterns rather than architectural changes, offering a practical path to safer LVLMs. Limitations include lack of formal theory and potential model-specific dynamics, suggesting future work on deeper theoretical grounding and broader backbone compatibility.

Abstract

Despite the great success of Large Vision-Language Models (LVLMs), they inevitably suffer from hallucination. As we know, both the visual encoder and the Large Language Model (LLM) decoder in LVLMs are Transformer-based, allowing the model to extract visual information and generate text outputs via attention mechanisms. We find that the attention distribution of LLM decoder on image tokens is highly consistent with the visual encoder and both distributions tend to focus on particular background tokens rather than the referred objects in the image. We attribute to the unexpected attention distribution to an inherent flaw in the visual encoder itself, which misguides LLMs to over emphasize the redundant information and generate object hallucination. To address the issue, we propose DAMRO, a novel training-free strategy that $D$ive into $A$ttention $M$echanism of LVLM to $R$educe $O$bject Hallucination. Specifically, our approach employs classification token (CLS) of ViT to filter out high-attention outlier tokens scattered in the background and then eliminate their influence during decoding stage. We evaluate our method on LVLMs including LLaVA-1.5, LLaVA-NeXT and InstructBLIP, using various benchmarks such as POPE, CHAIR, MME and GPT-4V Aided Evaluation. The results demonstrate that our approach significantly reduces the impact of these outlier tokens, thus effectively alleviating the hallucination of LVLMs. The code is released at https://github.com/coder-gx/DAMRO.

DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination

TL;DR

DAMRO investigates the link between ViT visual encoder attention and LLM decoder attention, revealing that high-attention outlier tokens correlate with object hallucination. It then provides a training-free solution that (1) selects top outlier tokens via the [CLS] attention and (2) applies adaptive contrastive decoding to downweight these tokens during generation. Across LLaVA-1.5, LLaVA-NeXT, and InstructBLIP, and benchmarks POPE, CHAIR, MME, and GPT-4V-aided evaluation, DAMRO consistently reduces hallucinations and improves factual alignment without external data. The approach shows strong generalizability due to its reliance on attention patterns rather than architectural changes, offering a practical path to safer LVLMs. Limitations include lack of formal theory and potential model-specific dynamics, suggesting future work on deeper theoretical grounding and broader backbone compatibility.

Abstract

Despite the great success of Large Vision-Language Models (LVLMs), they inevitably suffer from hallucination. As we know, both the visual encoder and the Large Language Model (LLM) decoder in LVLMs are Transformer-based, allowing the model to extract visual information and generate text outputs via attention mechanisms. We find that the attention distribution of LLM decoder on image tokens is highly consistent with the visual encoder and both distributions tend to focus on particular background tokens rather than the referred objects in the image. We attribute to the unexpected attention distribution to an inherent flaw in the visual encoder itself, which misguides LLMs to over emphasize the redundant information and generate object hallucination. To address the issue, we propose DAMRO, a novel training-free strategy that ive into ttention echanism of LVLM to educe bject Hallucination. Specifically, our approach employs classification token (CLS) of ViT to filter out high-attention outlier tokens scattered in the background and then eliminate their influence during decoding stage. We evaluate our method on LVLMs including LLaVA-1.5, LLaVA-NeXT and InstructBLIP, using various benchmarks such as POPE, CHAIR, MME and GPT-4V Aided Evaluation. The results demonstrate that our approach significantly reduces the impact of these outlier tokens, thus effectively alleviating the hallucination of LVLMs. The code is released at https://github.com/coder-gx/DAMRO.
Paper Structure (31 sections, 8 equations, 16 figures, 12 tables, 1 algorithm)

This paper contains 31 sections, 8 equations, 16 figures, 12 tables, 1 algorithm.

Figures (16)

  • Figure 1: An overview of DAMRO. We utilize attention mechanism to filter the outlier tokens, and then apply contrastive decoding to mitigate the influence of outlier tokens in LLM decoding stage.
  • Figure 2: Attention map of visual encoder. Left: original image. Middle: attention map of InstructBLIP ViT (16x16). Right: attention map of LLaVA-1.5 ViT (24x24).
  • Figure 3: LLM decoder attention map of "plant" token (non-hallucinatory). It is evident that attention can accurately locate the position of the plotted plant.
  • Figure 4: LLM decoder attention map of "clock" token (hallucinatory). The attention mainly focus on the outlier tokens in the background, whose positions are the same in visual encoder attention map in the right sub-image of Figure \ref{['fig:cls']}.
  • Figure 5: The proportion of the overall attention map in LLM decoder.
  • ...and 11 more figures