Table of Contents
Fetching ...

CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models

Yongheng Zhang, Xu Liu, Ruoxi Zhou, Qiguang Chen, Hao Fei, Wenpeng Lu, Libo Qin

TL;DR

CCHall introduces the first benchmark to evaluate joint cross-lingual and cross-modal hallucinations in large language models, addressing a gap where prior work treated these settings separately. It combines raw multimodal data with cross-modal, cross-lingual, and joint hallucination data across VQA and image captioning tasks, including translations into multiple resource-level languages and human verification. The paper reports that current MLLMs remain far from robust on CCHall, though methods like UniHD with external tools and multilingual prompts provide meaningful gains. By providing open data and code, CCHall aims to drive progress in reducing joint hallucinations and improving the reliability of multimodal, multilingual LLM systems in real-world deployments.

Abstract

Investigating hallucination issues in large language models (LLMs) within cross-lingual and cross-modal scenarios can greatly advance the large-scale deployment in real-world applications. Nevertheless, the current studies are limited to a single scenario, either cross-lingual or cross-modal, leaving a gap in the exploration of hallucinations in the joint cross-lingual and cross-modal scenarios. Motivated by this, we introduce a novel joint Cross-lingual and Cross-modal Hallucinations benchmark (CCHall) to fill this gap. Specifically, CCHall simultaneously incorporates both cross-lingual and cross-modal hallucination scenarios, which can be used to assess the cross-lingual and cross-modal capabilities of LLMs. Furthermore, we conduct a comprehensive evaluation on CCHall, exploring both mainstream open-source and closed-source LLMs. The experimental results highlight that current LLMs still struggle with CCHall. We hope CCHall can serve as a valuable resource to assess LLMs in joint cross-lingual and cross-modal scenarios.

CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models

TL;DR

CCHall introduces the first benchmark to evaluate joint cross-lingual and cross-modal hallucinations in large language models, addressing a gap where prior work treated these settings separately. It combines raw multimodal data with cross-modal, cross-lingual, and joint hallucination data across VQA and image captioning tasks, including translations into multiple resource-level languages and human verification. The paper reports that current MLLMs remain far from robust on CCHall, though methods like UniHD with external tools and multilingual prompts provide meaningful gains. By providing open data and code, CCHall aims to drive progress in reducing joint hallucinations and improving the reliability of multimodal, multilingual LLM systems in real-world deployments.

Abstract

Investigating hallucination issues in large language models (LLMs) within cross-lingual and cross-modal scenarios can greatly advance the large-scale deployment in real-world applications. Nevertheless, the current studies are limited to a single scenario, either cross-lingual or cross-modal, leaving a gap in the exploration of hallucinations in the joint cross-lingual and cross-modal scenarios. Motivated by this, we introduce a novel joint Cross-lingual and Cross-modal Hallucinations benchmark (CCHall) to fill this gap. Specifically, CCHall simultaneously incorporates both cross-lingual and cross-modal hallucination scenarios, which can be used to assess the cross-lingual and cross-modal capabilities of LLMs. Furthermore, we conduct a comprehensive evaluation on CCHall, exploring both mainstream open-source and closed-source LLMs. The experimental results highlight that current LLMs still struggle with CCHall. We hope CCHall can serve as a valuable resource to assess LLMs in joint cross-lingual and cross-modal scenarios.

Paper Structure

This paper contains 33 sections, 3 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: (a) Cross-lingual hallucination: A cross-lingual hallucination emerged: the erroneous translation of "stand" as "站在". Here it should be "忍受"; (b) Cross-modal hallucination: A cross-modal hallucination occurred, fabricating a "bridge"; (c) Cross-lingual and Cross-modal hallucination: A cross-modal hallucination fabricated "Oranges" and a cross-lingual hallucination did not use Chinese in its Answer.
  • Figure 3: The construction process of CCHall includes: (a) Raw Multi-modal Dataset Selection ($\S \ref{['raw']}$), (b) Cross-modal Hallucination Data Construction ($\S \ref{['Cross-modal Hallucination Data']}$), (c) Cross-lingual Hallucination Data Construction ($\S \ref{['Cross-lingual Hallucination Data']}$), and (d) Cross-modal and Cross-lingual Hallucination Dataset ($\S \ref{['CCHall Dataset']}$).
  • Figure 4: Presentation of data in CCHall: (a) The diversity of multi-modal data as represented by CLIP-based radford2021learning classification. (b) Display of part of the detailed topics in CCHall.
  • Figure 5: Visualization of the semantic feature coverage of all languages in CCHall, demonstrating the distribution and range of linguistic representations.
  • Figure 6: Analysis of the underlying causes of cross-lingual and cross-modal hallucinations in MLLMs.
  • ...and 12 more figures