When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance

Jinjin Cao; Zhiyang Chen; Zijun Wang; Liyuan Ma; Weijian Luo; Guojun Qi

When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance

Jinjin Cao, Zhiyang Chen, Zijun Wang, Liyuan Ma, Weijian Luo, Guojun Qi

TL;DR

This work tackles the problem of language bias-driven hallucinations in Vision-Language Models by introducing Cross-Modal Guidance (CMG), a training-free decoding technique that perturbs visual-language attention during inference. By constructing an Amateur Model through selective attention masks and comparing outputs to the original model, CMG enhances reliance on visual context, reducing hallucinations without additional training. Across POPE, HallusionBench, and MME benchmarks, CMG consistently improves accuracy and perceptual tasks, with robust performance across model sizes. The approach offers practical, scalable mitigation of hallucinations, though it requires careful hyperparameter tuning and dynamic layer/attention masking to avoid unintended amplification of biases.

Abstract

Vision-Language Models (VLMs) have shown solid ability for multimodal understanding of both visual and language contexts. However, existing VLMs often face severe challenges of hallucinations, meaning that VLMs tend to generate responses that are only fluent in the language but irrelevant to images in previous contexts. To address this issue, we analyze how language bias contributes to hallucinations and then introduce Cross-Modal Guidance(CMG), a training-free decoding method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected transformer layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced decoding emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs. In experiment sections, we conduct comprehensive studies. All results demonstrate the superior advantages of CMG with neither additional conditions nor training costs. We also quantitatively show CMG can improve different VLM's performance on hallucination-specific benchmarks and generalize effectively.

When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance

TL;DR

Abstract

When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)