PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering
Yihao Ding, Kaixuan Ren, Jiabin Huang, Siwen Luo, Soyeon Caren Han
TL;DR
This work tackles retrieval-based multimodal question answering for densely texted, multi-page VRDs by introducing PDF-MVQA, a large PubMed Central–derived dataset with rich document-semantic entity annotations. It proposes a Multimodal Multi-page Retriever family, including RoI-based, image-patch-based, and a Joint-grained variant that augments coarse entity representations with fine-grained token content using long-sequence models. Empirical results show patch-based models excel on exact and partial matching, while Joint-grained Retrieval yields the strongest overall and recall performance, especially on complex sections and longer documents; real-world OCR pipelines further demonstrate robustness gains. The work advances cross-page VRD-QA by combining implicit VLPM knowledge with explicit document-layout semantics, enabling more reliable retrieval of paragraphs, tables, and figures across entire documents, with practical implications for literature review and knowledge extraction.
Abstract
Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD), particularly those dominated by lengthy textual content like research journal articles. Existing studies primarily focus on real-world documents with sparse text, while challenges persist in comprehending the hierarchical semantic relations among multiple pages to locate multimodal components. To address this gap, we propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval. Unlike traditional machine reading comprehension (MRC) tasks, our approach aims to retrieve entire paragraphs containing answers or visually rich document entities like tables and figures. Our contributions include the introduction of a comprehensive PDF Document VQA dataset, allowing the examination of semantically hierarchical layout structures in text-dominant documents. We also present new VRD-QA frameworks designed to grasp textual contents and relations among document layouts simultaneously, extending page-level understanding to the entire multi-page document. Through this work, we aim to enhance the capabilities of existing vision-and-language models in handling challenges posed by text-dominant documents in VRD-QA.
