Table of Contents
Fetching ...

CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation

Yexing Du, Kaiyuan Liu, Youcheng Pan, Zheng Chu, Bo Yang, Xiaocheng Feng, Ming Liu, Yang Xiang

TL;DR

<p>CCFQA addresses the lack of multilingual, cross-modal factuality evaluation for Multimodal LLMs by introducing a fully parallel benchmark spanning 8 languages and 14,400 speech-text QA items across QA, XQA, SQA, and XSQA. It combines two data-construction phases (cross-lingual and cross-modal) with rigorous quality controls (translation, human recheck, and ASR-based refinement) to enable direct cross-language and cross-modal comparisons. The authors also propose an English-pivot few-shot transfer strategy that leverages English QA strength to boost multilingual SQA performance, achieving competitive results with GPT-4o-mini-Audio using only 5-shot examples. Experimental results reveal substantial cross-lingual and cross-modal inconsistencies in current MLLMs and demonstrate the promise of English-bridged transfer for improving factual consistency in multilingual spoken QA, with the dataset and code released for community use.

Abstract

As Large Language Models (LLMs) are increasingly popularized in the multilingual world, ensuring hallucination-free factuality becomes markedly crucial. However, existing benchmarks for evaluating the reliability of Multimodal Large Language Models (MLLMs) predominantly focus on textual or visual modalities with a primary emphasis on English, which creates a gap in evaluation when processing multilingual input, especially in speech. To bridge this gap, we propose a novel Cross-lingual and Cross-modal Factuality benchmark (CCFQA). Specifically, the CCFQA benchmark contains parallel speech-text factual questions across 8 languages, designed to systematically evaluate MLLMs' cross-lingual and cross-modal factuality capabilities. Our experimental results demonstrate that current MLLMs still face substantial challenges on the CCFQA benchmark. Furthermore, we propose a few-shot transfer learning strategy that effectively transfers the Question Answering (QA) capabilities of LLMs in English to multilingual Spoken Question Answering (SQA) tasks, achieving competitive performance with GPT-4o-mini-Audio using just 5-shot training. We release CCFQA as a foundational research resource to promote the development of MLLMs with more robust and reliable speech understanding capabilities. Our code and dataset are available at https://github.com/yxduir/ccfqa.

CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation

TL;DR

<p>CCFQA addresses the lack of multilingual, cross-modal factuality evaluation for Multimodal LLMs by introducing a fully parallel benchmark spanning 8 languages and 14,400 speech-text QA items across QA, XQA, SQA, and XSQA. It combines two data-construction phases (cross-lingual and cross-modal) with rigorous quality controls (translation, human recheck, and ASR-based refinement) to enable direct cross-language and cross-modal comparisons. The authors also propose an English-pivot few-shot transfer strategy that leverages English QA strength to boost multilingual SQA performance, achieving competitive results with GPT-4o-mini-Audio using only 5-shot examples. Experimental results reveal substantial cross-lingual and cross-modal inconsistencies in current MLLMs and demonstrate the promise of English-bridged transfer for improving factual consistency in multilingual spoken QA, with the dataset and code released for community use.

Abstract

As Large Language Models (LLMs) are increasingly popularized in the multilingual world, ensuring hallucination-free factuality becomes markedly crucial. However, existing benchmarks for evaluating the reliability of Multimodal Large Language Models (MLLMs) predominantly focus on textual or visual modalities with a primary emphasis on English, which creates a gap in evaluation when processing multilingual input, especially in speech. To bridge this gap, we propose a novel Cross-lingual and Cross-modal Factuality benchmark (CCFQA). Specifically, the CCFQA benchmark contains parallel speech-text factual questions across 8 languages, designed to systematically evaluate MLLMs' cross-lingual and cross-modal factuality capabilities. Our experimental results demonstrate that current MLLMs still face substantial challenges on the CCFQA benchmark. Furthermore, we propose a few-shot transfer learning strategy that effectively transfers the Question Answering (QA) capabilities of LLMs in English to multilingual Spoken Question Answering (SQA) tasks, achieving competitive performance with GPT-4o-mini-Audio using just 5-shot training. We release CCFQA as a foundational research resource to promote the development of MLLMs with more robust and reliable speech understanding capabilities. Our code and dataset are available at https://github.com/yxduir/ccfqa.

Paper Structure

This paper contains 45 sections, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Factual Inconsistency in MLLMs. (a) Cross-lingual Inconsistency: inconsistent answers for the questions across different languages; (b) Cross-modal Inconsistency: inconsistent answers for the questions across different modalities; (c) Cross-lingual & Cross-modal Inconsistency.
  • Figure 2: CCFQA Dataset Construction: (a) Cross-Lingual Data Construction, (b) Cross-Modal Data Construction.
  • Figure 3: The Question Categories in the CCFQA.
  • Figure 4: The Architecture of LLM-SQA.
  • Figure 5: Performance Across Categories on XSQA Task.
  • ...and 2 more figures