Table of Contents
Fetching ...

Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework

Yuming Yang, Jiang Zhong, Li Jin, Jingwang Huang, Jingpeng Gao, Qing Liu, Yang Bai, Jingyuan Zhang, Rui Jiang, Kaiwen Wei

TL;DR

This work addresses the limitation of current MRAG benchmarks by introducing Chart-based MRAG and CHARGE, a semi-automatic framework to generate high-quality chart-related evaluation data, resulting in Chart-MRAG Bench with 4,738 QA pairs across 8 domains. It demonstrates that existing vision-language models struggle to effectively utilize retrieved visual knowledge in chart contexts, with 58.19% Correctness and 73.87% Coverage even under ground-truth retrieval, and reveals a prevalent text-over-visual bias in multimodal reasoning. The contributions include a comprehensive benchmark and data-collection framework, empirical insights into the limits of both open-source and proprietary LVLMs, and guidance for enhancing visual knowledge retrieval and source selection. The work advances practical understanding of visually grounded reasoning and offers resources for robust evaluation of vision-centric RAG systems in real-world document analysis.

Abstract

Multimodal Retrieval-Augmented Generation (MRAG) enhances reasoning capabilities by integrating external knowledge. However, existing benchmarks primarily focus on simple image-text interactions, overlooking complex visual formats like charts that are prevalent in real-world applications. In this work, we introduce a novel task, Chart-based MRAG, to address this limitation. To semi-automatically generate high-quality evaluation samples, we propose CHARt-based document question-answering GEneration (CHARGE), a framework that produces evaluation data through structured keypoint extraction, crossmodal verification, and keypoint-based generation. By combining CHARGE with expert validation, we construct Chart-MRAG Bench, a comprehensive benchmark for chart-based MRAG evaluation, featuring 4,738 question-answering pairs across 8 domains from real-world documents. Our evaluation reveals three critical limitations in current approaches: (1) unified multimodal embedding retrieval methods struggles in chart-based scenarios, (2) even with ground-truth retrieval, state-of-the-art MLLMs achieve only 58.19% Correctness and 73.87% Coverage scores, and (3) MLLMs demonstrate consistent text-over-visual modality bias during Chart-based MRAG reasoning. The CHARGE and Chart-MRAG Bench are released at https://github.com/Nomothings/CHARGE.git.

Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework

TL;DR

This work addresses the limitation of current MRAG benchmarks by introducing Chart-based MRAG and CHARGE, a semi-automatic framework to generate high-quality chart-related evaluation data, resulting in Chart-MRAG Bench with 4,738 QA pairs across 8 domains. It demonstrates that existing vision-language models struggle to effectively utilize retrieved visual knowledge in chart contexts, with 58.19% Correctness and 73.87% Coverage even under ground-truth retrieval, and reveals a prevalent text-over-visual bias in multimodal reasoning. The contributions include a comprehensive benchmark and data-collection framework, empirical insights into the limits of both open-source and proprietary LVLMs, and guidance for enhancing visual knowledge retrieval and source selection. The work advances practical understanding of visually grounded reasoning and offers resources for robust evaluation of vision-centric RAG systems in real-world document analysis.

Abstract

Multimodal Retrieval-Augmented Generation (MRAG) enhances reasoning capabilities by integrating external knowledge. However, existing benchmarks primarily focus on simple image-text interactions, overlooking complex visual formats like charts that are prevalent in real-world applications. In this work, we introduce a novel task, Chart-based MRAG, to address this limitation. To semi-automatically generate high-quality evaluation samples, we propose CHARt-based document question-answering GEneration (CHARGE), a framework that produces evaluation data through structured keypoint extraction, crossmodal verification, and keypoint-based generation. By combining CHARGE with expert validation, we construct Chart-MRAG Bench, a comprehensive benchmark for chart-based MRAG evaluation, featuring 4,738 question-answering pairs across 8 domains from real-world documents. Our evaluation reveals three critical limitations in current approaches: (1) unified multimodal embedding retrieval methods struggles in chart-based scenarios, (2) even with ground-truth retrieval, state-of-the-art MLLMs achieve only 58.19% Correctness and 73.87% Coverage scores, and (3) MLLMs demonstrate consistent text-over-visual modality bias during Chart-based MRAG reasoning. The CHARGE and Chart-MRAG Bench are released at https://github.com/Nomothings/CHARGE.git.

Paper Structure

This paper contains 35 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Example scenarios from MRAG-Bench. Previous benchmarks chang2022webqamultihopmultimodalqaencvqachen2023infoseek mainly focused on retrieving from textual knowledge. However, there are scenarios where retrieving correct textual knowledge is hard and sometimes not as useful as visual knowledge.
  • Figure 2: Key statistics of MRAG-Bench.
  • Figure 3: Qualitative examples on MRAG-Bench. For each scenario, we show the result of GPT-4o gpt4, Gemini Pro team2023gemini, LLaVA-Next-Interleave li2024llavanextinterleavetacklingmultiimagevideo and Mantis-8B-Siglip jiang2024mantis. The ground-truth answer is in blue.
  • Figure 4: Qualitative Example of Proprietary model (Gemini Pro) identifies and utilizes correct examples, while open-source model (LLaVA-Next-Interleave) is misled by noisy retrieved information, resulting in incorrect answers.
  • Figure 5: Left: LLaVA-Next-Interleave results with 4 different multimodal retrievers. Its performance using retrieved images correlates 95% with retriever's Recall@5 scores. Right: Average results of three random seed runs. Improve the number of ground-truth RAG examples shows steady increase of model's performance, reaches the maximum with 10 examples.
  • ...and 2 more figures