Table of Contents
Fetching ...

When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance

Jinjin Cao, Zhiyang Chen, Zijun Wang, Liyuan Ma, Weijian Luo, Guojun Qi

TL;DR

This work tackles the problem of language bias-driven hallucinations in Vision-Language Models by introducing Cross-Modal Guidance (CMG), a training-free decoding technique that perturbs visual-language attention during inference. By constructing an Amateur Model through selective attention masks and comparing outputs to the original model, CMG enhances reliance on visual context, reducing hallucinations without additional training. Across POPE, HallusionBench, and MME benchmarks, CMG consistently improves accuracy and perceptual tasks, with robust performance across model sizes. The approach offers practical, scalable mitigation of hallucinations, though it requires careful hyperparameter tuning and dynamic layer/attention masking to avoid unintended amplification of biases.

Abstract

Vision-Language Models (VLMs) have shown solid ability for multimodal understanding of both visual and language contexts. However, existing VLMs often face severe challenges of hallucinations, meaning that VLMs tend to generate responses that are only fluent in the language but irrelevant to images in previous contexts. To address this issue, we analyze how language bias contributes to hallucinations and then introduce Cross-Modal Guidance(CMG), a training-free decoding method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected transformer layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced decoding emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs. In experiment sections, we conduct comprehensive studies. All results demonstrate the superior advantages of CMG with neither additional conditions nor training costs. We also quantitatively show CMG can improve different VLM's performance on hallucination-specific benchmarks and generalize effectively.

When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance

TL;DR

This work tackles the problem of language bias-driven hallucinations in Vision-Language Models by introducing Cross-Modal Guidance (CMG), a training-free decoding technique that perturbs visual-language attention during inference. By constructing an Amateur Model through selective attention masks and comparing outputs to the original model, CMG enhances reliance on visual context, reducing hallucinations without additional training. Across POPE, HallusionBench, and MME benchmarks, CMG consistently improves accuracy and perceptual tasks, with robust performance across model sizes. The approach offers practical, scalable mitigation of hallucinations, though it requires careful hyperparameter tuning and dynamic layer/attention masking to avoid unintended amplification of biases.

Abstract

Vision-Language Models (VLMs) have shown solid ability for multimodal understanding of both visual and language contexts. However, existing VLMs often face severe challenges of hallucinations, meaning that VLMs tend to generate responses that are only fluent in the language but irrelevant to images in previous contexts. To address this issue, we analyze how language bias contributes to hallucinations and then introduce Cross-Modal Guidance(CMG), a training-free decoding method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected transformer layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced decoding emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs. In experiment sections, we conduct comprehensive studies. All results demonstrate the superior advantages of CMG with neither additional conditions nor training costs. We also quantitatively show CMG can improve different VLM's performance on hallucination-specific benchmarks and generalize effectively.

Paper Structure

This paper contains 27 sections, 13 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: An illustration of hallucinations induced by language bias. (a) Examples of hallucinations induced by language bias in VLMs. The blue words are hallucination contents. (b) Accuracy in MMMUyue2024mmmu Benchmark on LLaVA-v1.5-7B. 'None*' represents images that are removed from the visual question input.
  • Figure 2: Architecture of Cross-Modal Guidance. CMG utilizes a perturbed self-attention map to amplify language priors in the underlying decoder-only transformer backbone. The original self-attention uses a causal mask, while the perturbed self-attention map replaces it with a dynamic mask, which varies from different samples. Perturbed self-attention is applied to several dynamically selected decoder-only layers. CMG contrasts the two distributions to correct hallucinations from the original outputs.
  • Figure 3: Visualization of Attention Weights Changes Across Transformer Layers. The overall trend of image token weight ratio is getting lower as the number of transformer layers increases.
  • Figure 4: Variation in Attention Weight Proportions Across Token Sequence Parts. (a) The proportion of image attention weights changes with transformer layer. (b) The proportion of image attention weights changes with generated token sequence lengthens.
  • Figure 5: Self Attention Weights with Causal Mask
  • ...and 5 more figures