Mitigating Multilingual Hallucination in Large Vision-Language Models

Xiaoye Qu; Mingyang Song; Wei Wei; Jianfeng Dong; Yu Cheng

Mitigating Multilingual Hallucination in Large Vision-Language Models

Xiaoye Qu, Mingyang Song, Wei Wei, Jianfeng Dong, Yu Cheng

TL;DR

This work tackles multilingual hallucination in large vision-language models by uncovering dual causes: insufficient multilingual instruction understanding and lack of multilingual hallucination-aware data. It introduces a two-stage Multilingual Hallucination Removal framework that first strengthens multilingual instruction following via supervised fine-tuning and then builds hallucination-aware data through cross-lingual alignment to train with direct preference optimization. The approach automatically generates multilingual data without manual annotation and demonstrates substantial reductions in hallucinations and strong accuracy gains across 13 languages on POPE MUL, as well as improvements on MME MUL and AMBER MUL, with evidence of generality to CogVLM. Overall, MHR provides a scalable, language-agnostic strategy to enhance reliability of LVLM outputs in multilingual contexts, enabling broader practical deployment.

Abstract

While Large Vision-Language Models (LVLMs) have exhibited remarkable capabilities across a wide range of tasks, they suffer from hallucination problems, where models generate plausible yet incorrect answers given the input image-query pair. This hallucination phenomenon is even more severe when querying the image in non-English languages, while existing methods for mitigating hallucinations in LVLMs only consider the English scenarios. In this paper, we make the first attempt to mitigate this important multilingual hallucination in LVLMs. With thorough experiment analysis, we found that multilingual hallucination in LVLMs is a systemic problem that could arise from deficiencies in multilingual capabilities or inadequate multimodal abilities. To this end, we propose a two-stage Multilingual Hallucination Removal (MHR) framework for LVLMs, aiming to improve resistance to hallucination for both high-resource and low-resource languages. Instead of relying on the intricate manual annotations of multilingual resources, we fully leverage the inherent capabilities of the LVLM and propose a novel cross-lingual alignment method, which generates multiple responses for each image-query input and then identifies the hallucination-aware pairs for each language. These data pairs are finally used for direct preference optimization to prompt the LVLMs to favor non-hallucinating responses. Experimental results show that our MHR achieves a substantial reduction in hallucination generation for LVLMs. Notably, on our extended multilingual POPE benchmark, our framework delivers an average increase of 19.0% in accuracy across 13 different languages. Our code and model weights are available at https://github.com/ssmisya/MHR

Mitigating Multilingual Hallucination in Large Vision-Language Models

TL;DR

Abstract

Paper Structure (26 sections, 3 equations, 7 figures, 7 tables)

This paper contains 26 sections, 3 equations, 7 figures, 7 tables.

Introduction
Related Work
Large Visual-Language Models
Hallucination in LVLMs
Multilingual Large Vision-Language Models
Method
Multilingual Supervised Fine-tuning
Construct Hallucination-Aware Data
Generating Non-English Responses
Cross-lingual Alignment
Construct Explicit Hallucination-aware Pairs
Construct Implicit Hallucination-aware Pairs
Multilingual Direct Preference Optimization
Experiment
Evaluation Metrics
...and 11 more sections

Figures (7)

Figure 1: Multilingual hallucinations in LLaVA-1.5. On the POPE MSCOCO benchmark, most languages have an accuracy under 70%, but English exceeds 85%.
Figure 2: Our Multilingual Hallucination Removal framework. Firstly, LVLM enhances the ability to follow multilingual instructions through supervised fine-tuning. Subsequently, based on an existing hallucination-aware dataset $D^h$, LVLM generates $N$ responses for each language given the corresponding language query. Then, the responses, English hallucination answer, and English no hallucination answers are used to generate haluciantion-aware pairs for final direct preference optimization.
Figure 3: "unknown prop" refers to the ratio of invalid answers generated. After multilingual supervised instruction, the unknown prop of all presented languages significantly decreased.
Figure 4: The performance on the full MME set, which consists of 14 tasks. Each graph displays the performance of the respective LLaVA-1.5 and our MHR model. Here we present results in four languages (uk, zh, bg, and ru) as outlined in Table \ref{['tab:mme']}.
Figure 5: Comparison of base LLaVA 1.5, LLaVA 1.5 with SFT, and LLaVA 1.5 with direct DPO.
...and 2 more figures

Mitigating Multilingual Hallucination in Large Vision-Language Models

TL;DR

Abstract

Mitigating Multilingual Hallucination in Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)