Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines

Zi-Ao Ma; Tian Lan; Rong-Cheng Tu; Yong Hu; Yu-Shi Zhu; Tong Zhang; Heyan Huang; Zhijing Wu; Xian-Ling Mao

Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines

Zi-Ao Ma, Tian Lan, Rong-Cheng Tu, Yong Hu, Yu-Shi Zhu, Tong Zhang, Heyan Huang, Zhijing Wu, Xian-Ling Mao

TL;DR

This work defines Multi-modal Retrieval Augmented Multi-modal Generation (M$^2$RAG), a task that enables foundation models to process mixed text-and-image web content and generate integrated multi-modal responses. It introduces a benchmark and a suite of text-modal and multi-modal evaluation metrics, plus two generation strategies (single-stage and multi-stage) and a high-quality training dataset built by filtering GPT-4o outputs with the proposed metrics. Across 12 models, results show that large-language models (LLMs) generally outperform open-source multi-modal LLMs (MLLMs), with multi-stage and joint modeling yielding the strongest performance; fine-tuning 7B-8B models can surpass GPT-4o on the benchmark, highlighting the value of data curation. The study validates metric reliability, analyzes cross-domain and topic effects, and demonstrates the benefits of auxiliary images, providing resources to spur future advances in multi-modal retrieval-augmented generation. $r = M_G(Q, K_{In-Doc})$ and $K_{In-Doc}$ are central to the pipeline, linking retrieval, generation, and multi-modal synthesis in a cohesive framework.

Abstract

We present a systematic investigation of Multi-modal Retrieval Augmented Multi-modal Generation (M$^2$RAG), a novel task that enables foundation models to process multi-modal web content and generate multi-modal responses, which exhibits better information density and readability. Despite its potential impact, M$^2$RAG remains understudied, lacking comprehensive analysis and high-quality data resources. To address this gap, we establish a comprehensive benchmark through a rigorous data curation pipeline, and employ text-modal metrics and multi-modal metrics based on foundation models for evaluation. We further propose several strategies for foundation models to process M$^2$RAG task effectively and construct a training set by filtering high-quality samples using our designed metrics. Our extensive experiments demonstrate the reliability of our proposed metrics, a landscape of model performance within our designed strategies, and show that our fine-tuned 7B-8B models outperform the GPT-4o model and approach the state-of-the-art OpenAI o3-mini. Additionally, we perform fine-grained analyses across diverse domains and validate the effectiveness of our designs in data curation pipeline. All resources, including codes, datasets, and model weights, will be publicly released.

Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines

TL;DR

This work defines Multi-modal Retrieval Augmented Multi-modal Generation (M

RAG), a task that enables foundation models to process mixed text-and-image web content and generate integrated multi-modal responses. It introduces a benchmark and a suite of text-modal and multi-modal evaluation metrics, plus two generation strategies (single-stage and multi-stage) and a high-quality training dataset built by filtering GPT-4o outputs with the proposed metrics. Across 12 models, results show that large-language models (LLMs) generally outperform open-source multi-modal LLMs (MLLMs), with multi-stage and joint modeling yielding the strongest performance; fine-tuning 7B-8B models can surpass GPT-4o on the benchmark, highlighting the value of data curation. The study validates metric reliability, analyzes cross-domain and topic effects, and demonstrates the benefits of auxiliary images, providing resources to spur future advances in multi-modal retrieval-augmented generation.

and

are central to the pipeline, linking retrieval, generation, and multi-modal synthesis in a cohesive framework.

Abstract

We present a systematic investigation of Multi-modal Retrieval Augmented Multi-modal Generation (M

RAG), a novel task that enables foundation models to process multi-modal web content and generate multi-modal responses, which exhibits better information density and readability. Despite its potential impact, M

RAG remains understudied, lacking comprehensive analysis and high-quality data resources. To address this gap, we establish a comprehensive benchmark through a rigorous data curation pipeline, and employ text-modal metrics and multi-modal metrics based on foundation models for evaluation. We further propose several strategies for foundation models to process M

RAG task effectively and construct a training set by filtering high-quality samples using our designed metrics. Our extensive experiments demonstrate the reliability of our proposed metrics, a landscape of model performance within our designed strategies, and show that our fine-tuned 7B-8B models outperform the GPT-4o model and approach the state-of-the-art OpenAI o3-mini. Additionally, we perform fine-grained analyses across diverse domains and validate the effectiveness of our designs in data curation pipeline. All resources, including codes, datasets, and model weights, will be publicly released.

Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines

TL;DR

Abstract

Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (23)