Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering
Kuicai Dong, Yujing Chang, Shijie Huang, Yasheng Wang, Ruiming Tang, Yong Liu
TL;DR
MMDocRAG introduces a large multimodal DocVQA/RAG benchmark comprising 4,055 expert-annotated QA pairs with cross-page, multimodal evidence chains and novel quote-selection and interleaved multimodal answer-generation metrics. The framework enables evaluation of retrieval, evidence selection, and the integration of text with images through interleaved outputs, applying both gold and noisy quotes to stress test systems. Extensive experiments across 60 models and 14 retrievers reveal persistent challenges in multimodal evidence retrieval and integration, with proprietary VLMs generally outperforming open-source options and fine-tuning plus high-quality image descriptions yielding notable gains. The results highlight that while multimodal inputs can help, robust multimodal DocVQA remains difficult, motivating future work on better evidence selection, grounding, and cross-modal reasoning, with MMDocRAG serving as a rigorous testing ground for progress.
Abstract
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that interleave text with relevant visual elements. Through large-scale experiments with 60 VLM/LLM models and 14 retrieval systems, we identify persistent challenges in multimodal evidence retrieval, selection, and integration.Key findings reveal advanced proprietary LVMs show superior performance than open-sourced alternatives. Also, they show moderate advantages using multimodal inputs over text-only inputs, while open-source alternatives show significant performance degradation. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image descriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems. Our benchmark and code are available at https://mmdocrag.github.io/MMDocRAG/.
