Table of Contents
Fetching ...

CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention

Zekai Ye, Qiming Li, Xiaocheng Feng, Libo Qin, Yichong Huang, Baohang Li, Kui Jiang, Yang Xiang, Zhirui Zhang, Yunfei Lu, Duyu Tang, Dandan Tu, Bing Qin

TL;DR

Multilingual object hallucination limits LVLM usefulness across languages. CLAIM introduces a near training-free, inference-time intervention that identifies language-specific cross-modal attention heads, estimates language-shift vectors from English to target languages, and applies shifts to align non-English visual perception with English. Empirical results on POPE and MME show substantial improvements in both object- and attribute-level hallucinations and demonstrate cross-language generalization, with attention divergence concentrated in intermediate layers. The work illuminates multilingual inference pathways in LVLMs and offers a low-overhead mechanism for improving cross-lingual visual grounding.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal abilities but remain prone to multilingual object hallucination, with a higher likelihood of generating responses inconsistent with the visual input when utilizing queries in non-English languages compared to English. Most existing approaches to address these rely on pretraining or fine-tuning, which are resource-intensive. In this paper, inspired by observing the disparities in cross-modal attention patterns across languages, we propose Cross-Lingual Attention Intervention for Mitigating multilingual object hallucination (CLAIM) in LVLMs, a novel near training-free method by aligning attention patterns. CLAIM first identifies language-specific cross-modal attention heads, then estimates language shift vectors from English to the target language, and finally intervenes in the attention outputs during inference to facilitate cross-lingual visual perception capability alignment. Extensive experiments demonstrate that CLAIM achieves an average improvement of 13.56% (up to 30% in Spanish) on the POPE and 21.75% on the hallucination subsets of the MME benchmark across various languages. Further analysis reveals that multilingual attention divergence is most prominent in intermediate layers, highlighting their critical role in multilingual scenarios.

CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention

TL;DR

Multilingual object hallucination limits LVLM usefulness across languages. CLAIM introduces a near training-free, inference-time intervention that identifies language-specific cross-modal attention heads, estimates language-shift vectors from English to target languages, and applies shifts to align non-English visual perception with English. Empirical results on POPE and MME show substantial improvements in both object- and attribute-level hallucinations and demonstrate cross-language generalization, with attention divergence concentrated in intermediate layers. The work illuminates multilingual inference pathways in LVLMs and offers a low-overhead mechanism for improving cross-lingual visual grounding.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal abilities but remain prone to multilingual object hallucination, with a higher likelihood of generating responses inconsistent with the visual input when utilizing queries in non-English languages compared to English. Most existing approaches to address these rely on pretraining or fine-tuning, which are resource-intensive. In this paper, inspired by observing the disparities in cross-modal attention patterns across languages, we propose Cross-Lingual Attention Intervention for Mitigating multilingual object hallucination (CLAIM) in LVLMs, a novel near training-free method by aligning attention patterns. CLAIM first identifies language-specific cross-modal attention heads, then estimates language shift vectors from English to the target language, and finally intervenes in the attention outputs during inference to facilitate cross-lingual visual perception capability alignment. Extensive experiments demonstrate that CLAIM achieves an average improvement of 13.56% (up to 30% in Spanish) on the POPE and 21.75% on the hallucination subsets of the MME benchmark across various languages. Further analysis reveals that multilingual attention divergence is most prominent in intermediate layers, highlighting their critical role in multilingual scenarios.

Paper Structure

This paper contains 34 sections, 8 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: A comparison of attention weights map between Chinese and English query. In English query, LVLM correctly focuses on the key object "bird" in the image, leading to an accurate response. However, in Chinese query, the model exhibits hallucination.
  • Figure 2: Overview of our proposed CLAIM method. A block of MHA in the figure represents a attention head. CLAIM intervene in identified language-specific cross-modal attention heads using estimated language shift vectors. (1) Identification of Language-Specific Cross-Modal Attention Heads § \ref{['sec:Identification']}: We train probes to identify the language-specific cross-modal attention heads, which exhibit significantly different behavior across languages associated with visual perception. (2) Estimation of Language Shift Vectors § \ref{['sec:Estimation']}: We estimate the language shift vectors in attention outputs from English to the target non-English language for identical images queried with captions. (3) Intervention during Inference § \ref{['sec:Intervention']}: During inference, we apply language shift vectors to intervene in the language-specific cross-modal attention heads for mitigating multilingual object hallucination.
  • Figure 3: Average scores for LLaVA-1.5 across five languages on the MME full dataset.
  • Figure 4: Logit lens observation for interpreting LLaVA-1.5 in multilingual scenarios. The depth of block color for the $i$-th token at layer $l$ indicates the magnitude of its contribution to the logits of the final predicted token. The color represents the corresponding tokens in image or text. The query means "Is there a car in the image?".
  • Figure 5: $A_b$ per layer of LLaVA-1.5 across four languages, and the per-layer average change rate of non-English languages $A_b$ relative to English.
  • ...and 5 more figures