Piecing It All Together: Verifying Multi-Hop Multimodal Claims
Haoran Wang, Aman Rangapur, Xiongxiao Xu, Yueqing Liang, Haroon Gharwi, Carl Yang, Kai Shu
TL;DR
This work introduces MMCV, a large-scale dataset for multi-hop multimodal claim verification that requires integrating evidence from text, images, and tables. It presents a three-stage pipeline—LLM-driven claim generation, refinement with Wikipedia context and RAG-based validation, and human annotation—to produce 15,569 claims with SUPPORT/REFUTE labels across up to four reasoning hops. The authors benchmark state-of-the-art multimodal LLMs under open-book and closed-book settings, reveal calibration and reliability issues, and demonstrate that advanced prompting (including symbolic and programmatic reasoning) can boost performance, though human experts still outperform models on higher-hop cases. The work highlights the challenges of cross-modal multi-hop reasoning and provides a valuable resource and evaluation framework to drive future research in multimodal fact verification and reasoning.
Abstract
Existing claim verification datasets often do not require systems to perform complex reasoning or effectively interpret multimodal evidence. To address this, we introduce a new task: multi-hop multimodal claim verification. This task challenges models to reason over multiple pieces of evidence from diverse sources, including text, images, and tables, and determine whether the combined multimodal evidence supports or refutes a given claim. To study this task, we construct MMCV, a large-scale dataset comprising 15k multi-hop claims paired with multimodal evidence, generated and refined using large language models, with additional input from human feedback. We show that MMCV is challenging even for the latest state-of-the-art multimodal large language models, especially as the number of reasoning hops increases. Additionally, we establish a human performance benchmark on a subset of MMCV. We hope this dataset and its evaluation task will encourage future research in multimodal multi-hop claim verification.
