Table of Contents
Fetching ...

Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models

Bin Li, Dehong Gao, Yeyuan Wang, Linbo Jin, Shanqing Yu, Xiaoyan Cai, Libin Yang

TL;DR

This work addresses hallucinations in large vision-language models by identifying image tokens that receive excessive attention but are irrelevant to the query. It introduces Instruction-Aligned Visual Attention (IAVA), which uses two instructions to measure attention changes and selects irrelevant tokens for masking; these tokens form negative samples used in a contrastive decoding step to downweight hallucinations. The approach is training-free and evaluated on LVLMs (LLaVA-v1.5 and InstructBLIP) across MME, POPE, and TextVQA, where IAVA outperforms existing contrastive decoding methods, achieving notable gains in accuracy and task metrics. The results demonstrate improved visual grounding and reliability of LVLM outputs with practical implications for safer and more accurate multimodal AI systems, with code released online.

Abstract

Despite the significant success of Large Vision-Language models(LVLMs), these models still suffer hallucinations when describing images, generating answers that include non-existent objects. It is reported that these models tend to over-focus on certain irrelevant image tokens that do not contain critical information for answering the question and distort the output. To address this, we propose an Instruction-Aligned Visual Attention(IAVA) approach, which identifies irrelevant tokens by comparing changes in attention weights under two different instructions. By applying contrastive decoding, we dynamically adjust the logits generated from original image tokens and irrelevant image tokens, reducing the model's over-attention to irrelevant information. The experimental results demonstrate that IAVA consistently outperforms existing decoding techniques on benchmarks such as MME, POPE, and TextVQA in mitigating object hallucinations. Our IAVA approach is available online at https://github.com/Lee-lab558/IAVA.

Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models

TL;DR

This work addresses hallucinations in large vision-language models by identifying image tokens that receive excessive attention but are irrelevant to the query. It introduces Instruction-Aligned Visual Attention (IAVA), which uses two instructions to measure attention changes and selects irrelevant tokens for masking; these tokens form negative samples used in a contrastive decoding step to downweight hallucinations. The approach is training-free and evaluated on LVLMs (LLaVA-v1.5 and InstructBLIP) across MME, POPE, and TextVQA, where IAVA outperforms existing contrastive decoding methods, achieving notable gains in accuracy and task metrics. The results demonstrate improved visual grounding and reliability of LVLM outputs with practical implications for safer and more accurate multimodal AI systems, with code released online.

Abstract

Despite the significant success of Large Vision-Language models(LVLMs), these models still suffer hallucinations when describing images, generating answers that include non-existent objects. It is reported that these models tend to over-focus on certain irrelevant image tokens that do not contain critical information for answering the question and distort the output. To address this, we propose an Instruction-Aligned Visual Attention(IAVA) approach, which identifies irrelevant tokens by comparing changes in attention weights under two different instructions. By applying contrastive decoding, we dynamically adjust the logits generated from original image tokens and irrelevant image tokens, reducing the model's over-attention to irrelevant information. The experimental results demonstrate that IAVA consistently outperforms existing decoding techniques on benchmarks such as MME, POPE, and TextVQA in mitigating object hallucinations. Our IAVA approach is available online at https://github.com/Lee-lab558/IAVA.

Paper Structure

This paper contains 17 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The irrelevant image tokens selection of our IAVA approach
  • Figure 2: Overview of the IAVA framework. (a) Identify which image tokens receive excessive attention but are irrelevant to the query by comparing the model’s attention scores under two different instructions. (b) Use the irrelevant image tokens obtained in (a) and perform contrastive decoding with the original image to mitigate the influence of these tokens.
  • Figure 3: Result comparison on MME. The IAVA approach achieved the highest overall score on both models, with the best performance on 10 sub-tasks when using LLaVA, and the best performance on 8 sub-tasks when using InstructBLIP.
  • Figure 4: Result Variation on MME with Parameter $i$