Table of Contents
Fetching ...

Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding

Jiaming Li, Jiacheng Zhang, Zequn Jie, Lin Ma, Guanbin Li

TL;DR

This work tackles hallucinations in large vision-language models by addressing both uni-modal overreliance and spurious inter-modality correlations. It introduces IMCCD, a training-free framework composed of CMVED, which selectively distorts cross-modal value vectors during decoding, and CDAR, which refines cross-modal attention using content-driven, position-normalized cues. Across POPE, CHAIR, and MME benchmarks, IMCCD consistently reduces hallucinations and improves descriptive fidelity, with faster inference than prior contrastive methods. The approach demonstrates strong generalization to different LVLMs and captioning tasks, and is accompanied by open-source code. The work advances reliable multi-modal generation by aligning cross-modal interactions with actual visual content.

Abstract

Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks. Despite their success, LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content. To address this issue, some approaches have introduced inference-time interventions, such as contrastive decoding and attention rectification, to reduce overreliance on language priors. However, these approaches overlook hallucinations stemming from spurious inter-modality correlations. In this paper, we propose an Inter-Modality Correlation Calibration Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner. In this method, we design a Cross-Modal Value-Enhanced Decoding(CMVED) module to alleviate hallucination by a novel contrastive decoding mechanism. During the estimation of distorted distribution, CMVED masks the value vectors associated with significant cross-modal attention weights, which address both uni-modality overreliance and misleading inter-modality correlations. Additionally, a Content-Driven Attention Refinement(CDAR) module refines cross-modal attention weights, guiding LVLMs to focus on important visual content. Experimental results on diverse hallucination benchmarks validate the superiority of our method over existing state-of-the-art techniques in reducing hallucinations in LVLM text generation. Our code will be available at https://github.com/lijm48/IMCCD.

Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding

TL;DR

This work tackles hallucinations in large vision-language models by addressing both uni-modal overreliance and spurious inter-modality correlations. It introduces IMCCD, a training-free framework composed of CMVED, which selectively distorts cross-modal value vectors during decoding, and CDAR, which refines cross-modal attention using content-driven, position-normalized cues. Across POPE, CHAIR, and MME benchmarks, IMCCD consistently reduces hallucinations and improves descriptive fidelity, with faster inference than prior contrastive methods. The approach demonstrates strong generalization to different LVLMs and captioning tasks, and is accompanied by open-source code. The work advances reliable multi-modal generation by aligning cross-modal interactions with actual visual content.

Abstract

Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks. Despite their success, LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content. To address this issue, some approaches have introduced inference-time interventions, such as contrastive decoding and attention rectification, to reduce overreliance on language priors. However, these approaches overlook hallucinations stemming from spurious inter-modality correlations. In this paper, we propose an Inter-Modality Correlation Calibration Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner. In this method, we design a Cross-Modal Value-Enhanced Decoding(CMVED) module to alleviate hallucination by a novel contrastive decoding mechanism. During the estimation of distorted distribution, CMVED masks the value vectors associated with significant cross-modal attention weights, which address both uni-modality overreliance and misleading inter-modality correlations. Additionally, a Content-Driven Attention Refinement(CDAR) module refines cross-modal attention weights, guiding LVLMs to focus on important visual content. Experimental results on diverse hallucination benchmarks validate the superiority of our method over existing state-of-the-art techniques in reducing hallucinations in LVLM text generation. Our code will be available at https://github.com/lijm48/IMCCD.
Paper Structure (20 sections, 11 equations, 8 figures, 12 tables)

This paper contains 20 sections, 11 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: An example to illustrate the spurious inter-modality correlation. The figure shows significant inter-modality attention between the text about the dining table and the food in the visual content, which leads to the hallucination of the object's existence. Existing decoding methods overlook the inter-modality correlation by distorting the image content, while our method preserves the inter-modality correlations by a selective mechanism.
  • Figure 2: An overview of the proposed IMCCD approach, consisting of two modules: Cross-modal Value-enhanced Decoding (CMVED) and Content-Driven Attention Refinement (CDAR). During inference, CMVED generates a distorted output distribution that favors hallucination by selectively masking value vectors corresponding to high attention weights in the cross-modal segment of the attention matrix. CMVED then performs contrastive decoding between the original outputs and distorted outputs to mitigate the hallucination. Additionally, CDAR refines the cross-modal segments of attention logits with content-driven attention logits estimated by normalizing the position indices of all image tokens to a uniform value.
  • Figure 3: An example to illustrate the over-reliance on the latter part of image tokens. The text tokens of LVLM pay more attention to the nearest image tokens than that of other image tokens, leading to the hallucination about the existence of TV.
  • Figure 4: Results of LLaVA1.5 on MME-Fullset.
  • Figure 5: The comparison of the hallucination rate of LLaVA 1.5 on the POPE dataset. 'TNR' and 'FNR' denote the true negative rate and the false negative rate of VQA, respectively. (a) The hallucination rate of the existence of objects with and without the co-existence with their top co-occurring object in the image. (b) The hallucination rate of the object's existence for different decoding methods with their top co-occurring object. Concretely, we estimate the mean hallucination rate on 5 pairs of objects with a high object existence rate.
  • ...and 3 more figures