Table of Contents
Fetching ...

Multimodal Fact-Level Attribution for Verifiable Reasoning

David Wan, Han Wang, Ziyang Wang, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal

TL;DR

MuRGAt introduces a rigorous benchmark and automatic evaluation pipeline for fact-level multimodal attribution in complex reasoning tasks spanning video, audio, and graphs. By decomposing evaluation into verifiable claim identification, atomic fact decomposition, and attribution quality, and by defining MuRGAt-Score as a coverage-weighted attribution metric, the work demonstrates that state-of-the-art MLLMs often excel at answering questions but struggle to provide faithful, precise citations. Automatic metrics show strong correlation with human judgments, enabling scalable benchmarking. Across experiments, increased reasoning depth can degrade grounding fidelity, while programmatic grounding improves attribution at the cost of some QA accuracy, highlighting a fundamental trade-off between reasoning and verifiable attribution. Overall, MuRGAt provides a path toward multimodal models that are both correct and verifiably grounded in heterogeneous input sources.

Abstract

Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.

Multimodal Fact-Level Attribution for Verifiable Reasoning

TL;DR

MuRGAt introduces a rigorous benchmark and automatic evaluation pipeline for fact-level multimodal attribution in complex reasoning tasks spanning video, audio, and graphs. By decomposing evaluation into verifiable claim identification, atomic fact decomposition, and attribution quality, and by defining MuRGAt-Score as a coverage-weighted attribution metric, the work demonstrates that state-of-the-art MLLMs often excel at answering questions but struggle to provide faithful, precise citations. Automatic metrics show strong correlation with human judgments, enabling scalable benchmarking. Across experiments, increased reasoning depth can degrade grounding fidelity, while programmatic grounding improves attribution at the cost of some QA accuracy, highlighting a fundamental trade-off between reasoning and verifiable attribution. Overall, MuRGAt provides a path toward multimodal models that are both correct and verifiably grounded in heterogeneous input sources.

Abstract

Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.
Paper Structure (46 sections, 5 equations, 19 figures, 13 tables)

This paper contains 46 sections, 5 equations, 19 figures, 13 tables.

Figures (19)

  • Figure 1: Overview of MuRGAt and the evaluation protocol. The model is given a question and multimodal sources and is asked to generate a response containing explicit reasoning and precise citations, including the specific modality and timestamp. To evaluate the response, we apply a fact-level multimodal attribution protocol. The generated response and its citations are processed through three subtasks: (1) verifiable claim identification, (2) atomic fact decomposition, and (3) attribution quality.
  • Figure 2: Gemini models' performance with different thinking levels.
  • Figure 3: Gemini-3-Flash results with program-aided generation on Worldsense.
  • Figure 4: Annotation UI for Attribution.
  • Figure 5: Comparative analysis of Gemini 2.5 Flash and Gemini 3 Pro. While Pro attempts higher-level narrative synthesis (e.g., spatial layouts and song titles), it suffers from lower grounding precision compared to Flash's minimalist, observation-first approach.
  • ...and 14 more figures