Table of Contents
Fetching ...

MRAMG-Bench: A Comprehensive Benchmark for Advancing Multimodal Retrieval-Augmented Multimodal Generation

Qinhan Yu, Zhiyou Xiao, Binghui Li, Zhengren Wang, Chong Chen, Wentao Zhang

TL;DR

MRAMG-Bench addresses the lack of evaluation resources for multimodal Retrieval-Augmented Multimodal Generation by introducing a comprehensive benchmark with six datasets across Web, Academia, and Lifestyle. It formalizes the MRAMG task, presents a three-stage dataset construction pipeline, and proposes a flexible generation framework with LLM, MLLM, and rule-based strategies. The paper provides a statistically grounded evaluation suite and reports extensive experiments over 11 generative models, revealing substantial gaps in current multimodal generation, particularly in image selection and insertion order under complex scenarios. The benchmark offers a practical platform to drive progress in end-to-end multimodal reasoning and answer generation, bridging research and real-world applications.

Abstract

Recent advances in Retrieval-Augmented Generation (RAG) have significantly improved response accuracy and relevance by incorporating external knowledge into Large Language Models (LLMs). However, existing RAG methods primarily focus on generating text-only answers, even in Multimodal Retrieval-Augmented Generation (MRAG) scenarios, where multimodal elements are retrieved to assist in generating text answers. To address this, we introduce the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task, in which we aim to generate multimodal answers that combine both text and images, fully leveraging the multimodal data within a corpus. Despite growing attention to this challenging task, a notable lack of a comprehensive benchmark persists for effectively evaluating its performance. To bridge this gap, we provide MRAMG-Bench, a meticulously curated, human-annotated benchmark comprising 4,346 documents, 14,190 images, and 4,800 QA pairs, distributed across six distinct datasets and spanning three domains: Web, Academia, and Lifestyle. The datasets incorporate diverse difficulty levels and complex multi-image scenarios, providing a robust foundation for evaluating the MRAMG task. To facilitate rigorous evaluation, MRAMG-Bench incorporates a comprehensive suite of both statistical and LLM-based metrics, enabling a thorough analysis of the performance of generative models in the MRAMG task. Additionally, we propose an efficient and flexible multimodal answer generation framework that can leverage LLMs/MLLMs to generate multimodal responses. Our datasets and complete evaluation results for 11 popular generative models are available at https://github.com/MRAMG-Bench/MRAMG.

MRAMG-Bench: A Comprehensive Benchmark for Advancing Multimodal Retrieval-Augmented Multimodal Generation

TL;DR

MRAMG-Bench addresses the lack of evaluation resources for multimodal Retrieval-Augmented Multimodal Generation by introducing a comprehensive benchmark with six datasets across Web, Academia, and Lifestyle. It formalizes the MRAMG task, presents a three-stage dataset construction pipeline, and proposes a flexible generation framework with LLM, MLLM, and rule-based strategies. The paper provides a statistically grounded evaluation suite and reports extensive experiments over 11 generative models, revealing substantial gaps in current multimodal generation, particularly in image selection and insertion order under complex scenarios. The benchmark offers a practical platform to drive progress in end-to-end multimodal reasoning and answer generation, bridging research and real-world applications.

Abstract

Recent advances in Retrieval-Augmented Generation (RAG) have significantly improved response accuracy and relevance by incorporating external knowledge into Large Language Models (LLMs). However, existing RAG methods primarily focus on generating text-only answers, even in Multimodal Retrieval-Augmented Generation (MRAG) scenarios, where multimodal elements are retrieved to assist in generating text answers. To address this, we introduce the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task, in which we aim to generate multimodal answers that combine both text and images, fully leveraging the multimodal data within a corpus. Despite growing attention to this challenging task, a notable lack of a comprehensive benchmark persists for effectively evaluating its performance. To bridge this gap, we provide MRAMG-Bench, a meticulously curated, human-annotated benchmark comprising 4,346 documents, 14,190 images, and 4,800 QA pairs, distributed across six distinct datasets and spanning three domains: Web, Academia, and Lifestyle. The datasets incorporate diverse difficulty levels and complex multi-image scenarios, providing a robust foundation for evaluating the MRAMG task. To facilitate rigorous evaluation, MRAMG-Bench incorporates a comprehensive suite of both statistical and LLM-based metrics, enabling a thorough analysis of the performance of generative models in the MRAMG task. Additionally, we propose an efficient and flexible multimodal answer generation framework that can leverage LLMs/MLLMs to generate multimodal responses. Our datasets and complete evaluation results for 11 popular generative models are available at https://github.com/MRAMG-Bench/MRAMG.

Paper Structure

This paper contains 37 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Illustration of the MRAMG task (above), with scenarios below showing how integrating text and images enhances clarity and understanding.
  • Figure 2: The MRAMG-Bench construction pipeline consists of three stages: (1) Data Selection and Preprocessing, where data is collected, cleaned, and preprocessed; (2) QA Generation and Refinement, involving the formulation and refinement of QA pairs; (3) Data Quality Check, where annotators and experts conduct a three-stage review to ensure high-quality benchmarks.
  • Figure 3: Comparison of the generative performance of two matching algorithms on different datasets.