Table of Contents
Fetching ...

MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding

Jingyuan Deng, Yujiu Yang

TL;DR

This work addresses LVLM hallucinations by introducing MaskCD, a training-free method that masks image-head-attention in the LLM backbone to create refined negative samples for contrastive decoding. MaskCD identifies image heads via attention statistics and integrates a masking scheme into the decoding process using the formula $p(y_t)=\text{softmax}\Bigl((1+\alpha)\cdot \text{logits}_\theta(y_t|I,Q,y_{<t}) - \alpha \cdot \text{logits}_{\theta_m}(y_t|I,Q,y_{<t})\Bigr)$. Across LLaVA-1.5-7b and Qwen-VL-7b, MaskCD achieves superior or competitive hallucination mitigation on CHAIR, POPE, AMBER, and MME benchmarks, while preserving general multimodal performance and offering lower computational cost than comparable methods like OPERA. Limitations include the need for image inputs to compute masks and backbone-specific mask applicability, suggesting avenues for dynamic, on-the-fly masking in future work.

Abstract

Large vision-language models (LVLMs) have shown remarkable performance in visual-language understanding for downstream multimodal tasks. While their capabilities are improving, problems emerge simultaneously. Among those problems, the hallucinations have attracted much attention, which stands for the phenomenon where LVLMs generate contradictory content to their input visual and text contents. Many approaches have been proposed to deal with this issue, such as contrastive decoding and attention manipulation. However, contrastive decoding methods struggle in constructing appropriate contrastive samples, and attention manipulation methods are highly sensitive, lacking stability. In this work, we propose image head Masked Contrastive Decoding (MaskCD). Our approach utilizes the "image heads" in LVLMs, masking them to construct contrastive samples for contrastive decoding. We evaluated MaskCD on LLaVA-1.5-7b and Qwen-VL-7b, using various benchmarks such as CHAIR, POPE, AMBER and MME. The results demonstrate that MaskCD effectively alleviates the phenomenon of hallucinations and retains the general capabilities of LVLMs. Corresponding resources could be found at: https://github.com/Deng-Jingyuan/MaskCD .

MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding

TL;DR

This work addresses LVLM hallucinations by introducing MaskCD, a training-free method that masks image-head-attention in the LLM backbone to create refined negative samples for contrastive decoding. MaskCD identifies image heads via attention statistics and integrates a masking scheme into the decoding process using the formula . Across LLaVA-1.5-7b and Qwen-VL-7b, MaskCD achieves superior or competitive hallucination mitigation on CHAIR, POPE, AMBER, and MME benchmarks, while preserving general multimodal performance and offering lower computational cost than comparable methods like OPERA. Limitations include the need for image inputs to compute masks and backbone-specific mask applicability, suggesting avenues for dynamic, on-the-fly masking in future work.

Abstract

Large vision-language models (LVLMs) have shown remarkable performance in visual-language understanding for downstream multimodal tasks. While their capabilities are improving, problems emerge simultaneously. Among those problems, the hallucinations have attracted much attention, which stands for the phenomenon where LVLMs generate contradictory content to their input visual and text contents. Many approaches have been proposed to deal with this issue, such as contrastive decoding and attention manipulation. However, contrastive decoding methods struggle in constructing appropriate contrastive samples, and attention manipulation methods are highly sensitive, lacking stability. In this work, we propose image head Masked Contrastive Decoding (MaskCD). Our approach utilizes the "image heads" in LVLMs, masking them to construct contrastive samples for contrastive decoding. We evaluated MaskCD on LLaVA-1.5-7b and Qwen-VL-7b, using various benchmarks such as CHAIR, POPE, AMBER and MME. The results demonstrate that MaskCD effectively alleviates the phenomenon of hallucinations and retains the general capabilities of LVLMs. Corresponding resources could be found at: https://github.com/Deng-Jingyuan/MaskCD .

Paper Structure

This paper contains 32 sections, 8 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Pipeline of MaskCD.The upper part shows the first step. The image head mask is constructed by querying LVLM with images and prompt texts. Then, the lower part shows how to use the image head mask in the process of contrastive decoding.
  • Figure 2: Visualization of image heads in LLaVA-1.5-7b. The left figure shows the image head distribution of real-world images, while the right one represents the results of Dall-E generated artificial images. It is evident that there are certain heads that tend to pay high attention on image tokens, therefore we name them with "image head".
  • Figure 3: Visualization of MME scores of LLava-1.5-7b(left) and Qwen-VL-7b(right). Scores are normalized by dividing maximum score of each subset.