Table of Contents
Fetching ...

MCiteBench: A Multimodal Benchmark for Generating Text with Citations

Caiyu Hu, Yikai Zhang, Tinghui Zhu, Yiwei Ye, Yanghua Xiao

TL;DR

MCiteBench introduces a multimodal benchmark to evaluate how well multimodal large language models generate text with citations grounded in multimodal evidence. By constructing an attribution corpus from academic papers and review-rebuttal interactions, and by creating QA pairs that require citing diverse evidence, the authors assess models along citation quality, source reliability, and answer accuracy using a judge-based evaluation pipeline. Experimental results reveal that current models struggle to ground outputs across modalities and exhibit a modality bias toward textual evidence, with multi-source scenarios offering partial credit but complicating exact-match attribution. The work highlights the need for stronger multimodal grounding and provides a concrete dataset and evaluation framework to guide future development of faithful, verifiable multimodal generation.

Abstract

Multimodal Large Language Models (MLLMs) have advanced in integrating diverse modalities but frequently suffer from hallucination. A promising solution to mitigate this issue is to generate text with citations, providing a transparent chain for verification. However, existing work primarily focuses on generating citations for text-only content, leaving the challenges of multimodal scenarios largely unexplored. In this paper, we introduce MCiteBench, the first benchmark designed to assess the ability of MLLMs to generate text with citations in multimodal contexts. Our benchmark comprises data derived from academic papers and review-rebuttal interactions, featuring diverse information sources and multimodal content. Experimental results reveal that MLLMs struggle to ground their outputs reliably when handling multimodal input. Further analysis uncovers a systematic modality bias and reveals how models internally rely on different sources when generating citations, offering insights into model behavior and guiding future directions for multimodal citation tasks.

MCiteBench: A Multimodal Benchmark for Generating Text with Citations

TL;DR

MCiteBench introduces a multimodal benchmark to evaluate how well multimodal large language models generate text with citations grounded in multimodal evidence. By constructing an attribution corpus from academic papers and review-rebuttal interactions, and by creating QA pairs that require citing diverse evidence, the authors assess models along citation quality, source reliability, and answer accuracy using a judge-based evaluation pipeline. Experimental results reveal that current models struggle to ground outputs across modalities and exhibit a modality bias toward textual evidence, with multi-source scenarios offering partial credit but complicating exact-match attribution. The work highlights the need for stronger multimodal grounding and provides a concrete dataset and evaluation framework to guide future development of faithful, verifiable multimodal generation.

Abstract

Multimodal Large Language Models (MLLMs) have advanced in integrating diverse modalities but frequently suffer from hallucination. A promising solution to mitigate this issue is to generate text with citations, providing a transparent chain for verification. However, existing work primarily focuses on generating citations for text-only content, leaving the challenges of multimodal scenarios largely unexplored. In this paper, we introduce MCiteBench, the first benchmark designed to assess the ability of MLLMs to generate text with citations in multimodal contexts. Our benchmark comprises data derived from academic papers and review-rebuttal interactions, featuring diverse information sources and multimodal content. Experimental results reveal that MLLMs struggle to ground their outputs reliably when handling multimodal input. Further analysis uncovers a systematic modality bias and reveals how models internally rely on different sources when generating citations, offering insights into model behavior and guiding future directions for multimodal citation tasks.

Paper Structure

This paper contains 45 sections, 3 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Illustration of the task form in MCiteBench. The model takes multimodal corpus and generates responses with explicit citations.
  • Figure 2: The construction pipeline of MCiteBench. Initially, we collect multimodal academic papers along with their corresponding review-rebuttal interactions and then parse the papers to extract candidate evidence. GPT-4o is used to extract explanation QA pairs from the comments and generate locating QA pairs. Next, human annotators match the references in the answers to the relevant content in the original papers. Finally, the data filtered and labeled by the model is manually verified by human annotators to ensure consistency and accuracy.
  • Figure 3: The calculation of Citation F1.
  • Figure 4: The calculation of Source F1 and Source Exact Match.
  • Figure 5: Source Exact Match score of models on the MCiteBench benchmark across different modalities, under the multi-source explanation setting with two gold evidence items per question.
  • ...and 5 more figures