Table of Contents
Fetching ...

Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines

Zi-Ao Ma, Tian Lan, Rong-Cheng Tu, Yong Hu, Yu-Shi Zhu, Tong Zhang, Heyan Huang, Zhijing Wu, Xian-Ling Mao

TL;DR

This work defines Multi-modal Retrieval Augmented Multi-modal Generation (M$^2$RAG), a task that enables foundation models to process mixed text-and-image web content and generate integrated multi-modal responses. It introduces a benchmark and a suite of text-modal and multi-modal evaluation metrics, plus two generation strategies (single-stage and multi-stage) and a high-quality training dataset built by filtering GPT-4o outputs with the proposed metrics. Across 12 models, results show that large-language models (LLMs) generally outperform open-source multi-modal LLMs (MLLMs), with multi-stage and joint modeling yielding the strongest performance; fine-tuning 7B-8B models can surpass GPT-4o on the benchmark, highlighting the value of data curation. The study validates metric reliability, analyzes cross-domain and topic effects, and demonstrates the benefits of auxiliary images, providing resources to spur future advances in multi-modal retrieval-augmented generation. $r = M_G(Q, K_{In-Doc})$ and $K_{In-Doc}$ are central to the pipeline, linking retrieval, generation, and multi-modal synthesis in a cohesive framework.

Abstract

We present a systematic investigation of Multi-modal Retrieval Augmented Multi-modal Generation (M$^2$RAG), a novel task that enables foundation models to process multi-modal web content and generate multi-modal responses, which exhibits better information density and readability. Despite its potential impact, M$^2$RAG remains understudied, lacking comprehensive analysis and high-quality data resources. To address this gap, we establish a comprehensive benchmark through a rigorous data curation pipeline, and employ text-modal metrics and multi-modal metrics based on foundation models for evaluation. We further propose several strategies for foundation models to process M$^2$RAG task effectively and construct a training set by filtering high-quality samples using our designed metrics. Our extensive experiments demonstrate the reliability of our proposed metrics, a landscape of model performance within our designed strategies, and show that our fine-tuned 7B-8B models outperform the GPT-4o model and approach the state-of-the-art OpenAI o3-mini. Additionally, we perform fine-grained analyses across diverse domains and validate the effectiveness of our designs in data curation pipeline. All resources, including codes, datasets, and model weights, will be publicly released.

Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines

TL;DR

This work defines Multi-modal Retrieval Augmented Multi-modal Generation (MRAG), a task that enables foundation models to process mixed text-and-image web content and generate integrated multi-modal responses. It introduces a benchmark and a suite of text-modal and multi-modal evaluation metrics, plus two generation strategies (single-stage and multi-stage) and a high-quality training dataset built by filtering GPT-4o outputs with the proposed metrics. Across 12 models, results show that large-language models (LLMs) generally outperform open-source multi-modal LLMs (MLLMs), with multi-stage and joint modeling yielding the strongest performance; fine-tuning 7B-8B models can surpass GPT-4o on the benchmark, highlighting the value of data curation. The study validates metric reliability, analyzes cross-domain and topic effects, and demonstrates the benefits of auxiliary images, providing resources to spur future advances in multi-modal retrieval-augmented generation. and are central to the pipeline, linking retrieval, generation, and multi-modal synthesis in a cohesive framework.

Abstract

We present a systematic investigation of Multi-modal Retrieval Augmented Multi-modal Generation (MRAG), a novel task that enables foundation models to process multi-modal web content and generate multi-modal responses, which exhibits better information density and readability. Despite its potential impact, MRAG remains understudied, lacking comprehensive analysis and high-quality data resources. To address this gap, we establish a comprehensive benchmark through a rigorous data curation pipeline, and employ text-modal metrics and multi-modal metrics based on foundation models for evaluation. We further propose several strategies for foundation models to process MRAG task effectively and construct a training set by filtering high-quality samples using our designed metrics. Our extensive experiments demonstrate the reliability of our proposed metrics, a landscape of model performance within our designed strategies, and show that our fine-tuned 7B-8B models outperform the GPT-4o model and approach the state-of-the-art OpenAI o3-mini. Additionally, we perform fine-grained analyses across diverse domains and validate the effectiveness of our designs in data curation pipeline. All resources, including codes, datasets, and model weights, will be publicly released.

Paper Structure

This paper contains 60 sections, 2 equations, 23 figures, 9 tables.

Figures (23)

  • Figure 1: A typical comparison between naive RAG (upper) and our proposed M$^2$RAG (lower). The generative model is GPT-4o in this case.
  • Figure 2: The framework of our proposed dataset construction and M$^2$RAG pipeline. Step 1-3 represent the data curation pipeline and Step 4 demonstrates our proposed generation strategies.
  • Figure 3: Average overall score across 10 topics.
  • Figure 4: Prompt template for filtering queries which are not complex questions.
  • Figure 5: Prompt template for filtering queries which are not necessarily answered with images.
  • ...and 18 more figures