Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework
Yuming Yang, Jiang Zhong, Li Jin, Jingwang Huang, Jingpeng Gao, Qing Liu, Yang Bai, Jingyuan Zhang, Rui Jiang, Kaiwen Wei
TL;DR
This work addresses the limitation of current MRAG benchmarks by introducing Chart-based MRAG and CHARGE, a semi-automatic framework to generate high-quality chart-related evaluation data, resulting in Chart-MRAG Bench with 4,738 QA pairs across 8 domains. It demonstrates that existing vision-language models struggle to effectively utilize retrieved visual knowledge in chart contexts, with 58.19% Correctness and 73.87% Coverage even under ground-truth retrieval, and reveals a prevalent text-over-visual bias in multimodal reasoning. The contributions include a comprehensive benchmark and data-collection framework, empirical insights into the limits of both open-source and proprietary LVLMs, and guidance for enhancing visual knowledge retrieval and source selection. The work advances practical understanding of visually grounded reasoning and offers resources for robust evaluation of vision-centric RAG systems in real-world document analysis.
Abstract
Multimodal Retrieval-Augmented Generation (MRAG) enhances reasoning capabilities by integrating external knowledge. However, existing benchmarks primarily focus on simple image-text interactions, overlooking complex visual formats like charts that are prevalent in real-world applications. In this work, we introduce a novel task, Chart-based MRAG, to address this limitation. To semi-automatically generate high-quality evaluation samples, we propose CHARt-based document question-answering GEneration (CHARGE), a framework that produces evaluation data through structured keypoint extraction, crossmodal verification, and keypoint-based generation. By combining CHARGE with expert validation, we construct Chart-MRAG Bench, a comprehensive benchmark for chart-based MRAG evaluation, featuring 4,738 question-answering pairs across 8 domains from real-world documents. Our evaluation reveals three critical limitations in current approaches: (1) unified multimodal embedding retrieval methods struggles in chart-based scenarios, (2) even with ground-truth retrieval, state-of-the-art MLLMs achieve only 58.19% Correctness and 73.87% Coverage scores, and (3) MLLMs demonstrate consistent text-over-visual modality bias during Chart-based MRAG reasoning. The CHARGE and Chart-MRAG Bench are released at https://github.com/Nomothings/CHARGE.git.
