Table of Contents
Fetching ...

A Survey of Multimodal Retrieval-Augmented Generation

Lang Mei, Siyu Mo, Zhihan Yang, Chong Chen

TL;DR

Multimodal Retrieval-Augmented Generation (MRAG) extends traditional RAG by grounding LLMs in multimodal sources (text, images, videos) to reduce hallucinations and improve factual accuracy. The paper traces MRAG evolution through MRAG1.0 (pseudo-MRAG) to MRAG3.0 (true multimodality), detailing components such as document parsing, multimodal retrieval, and generation, along with novel modules like multimodal search planning and retrieval refinement. It catalogs extensive multimodal datasets spanning retrieval+generation benchmarks, generation-specific tasks, and multidisciplinary domains, and reviews evaluation metrics that combine rule-based and LLM/MLLM-based approaches. The survey also discusses key challenges—data accuracy, planning adaptivity, cross-modal retrieval, and comprehensive evaluation—and offers forward-looking directions for parsing, search planning, retrieval, generation, and benchmarks to advance MRAG research and applications.

Abstract

Multimodal Retrieval-Augmented Generation (MRAG) enhances large language models (LLMs) by integrating multimodal data (text, images, videos) into retrieval and generation processes, overcoming the limitations of text-only Retrieval-Augmented Generation (RAG). While RAG improves response accuracy by incorporating external textual knowledge, MRAG extends this framework to include multimodal retrieval and generation, leveraging contextual information from diverse data types. This approach reduces hallucinations and enhances question-answering systems by grounding responses in factual, multimodal knowledge. Recent studies show MRAG outperforms traditional RAG, especially in scenarios requiring both visual and textual understanding. This survey reviews MRAG's essential components, datasets, evaluation methods, and limitations, providing insights into its construction and improvement. It also identifies challenges and future research directions, highlighting MRAG's potential to revolutionize multimodal information retrieval and generation. By offering a comprehensive perspective, this work encourages further exploration into this promising paradigm.

A Survey of Multimodal Retrieval-Augmented Generation

TL;DR

Multimodal Retrieval-Augmented Generation (MRAG) extends traditional RAG by grounding LLMs in multimodal sources (text, images, videos) to reduce hallucinations and improve factual accuracy. The paper traces MRAG evolution through MRAG1.0 (pseudo-MRAG) to MRAG3.0 (true multimodality), detailing components such as document parsing, multimodal retrieval, and generation, along with novel modules like multimodal search planning and retrieval refinement. It catalogs extensive multimodal datasets spanning retrieval+generation benchmarks, generation-specific tasks, and multidisciplinary domains, and reviews evaluation metrics that combine rule-based and LLM/MLLM-based approaches. The survey also discusses key challenges—data accuracy, planning adaptivity, cross-modal retrieval, and comprehensive evaluation—and offers forward-looking directions for parsing, search planning, retrieval, generation, and benchmarks to advance MRAG research and applications.

Abstract

Multimodal Retrieval-Augmented Generation (MRAG) enhances large language models (LLMs) by integrating multimodal data (text, images, videos) into retrieval and generation processes, overcoming the limitations of text-only Retrieval-Augmented Generation (RAG). While RAG improves response accuracy by incorporating external textual knowledge, MRAG extends this framework to include multimodal retrieval and generation, leveraging contextual information from diverse data types. This approach reduces hallucinations and enhances question-answering systems by grounding responses in factual, multimodal knowledge. Recent studies show MRAG outperforms traditional RAG, especially in scenarios requiring both visual and textual understanding. This survey reviews MRAG's essential components, datasets, evaluation methods, and limitations, providing insights into its construction and improvement. It also identifies challenges and future research directions, highlighting MRAG's potential to revolutionize multimodal information retrieval and generation. By offering a comprehensive perspective, this work encourages further exploration into this promising paradigm.

Paper Structure

This paper contains 48 sections, 7 figures, 1 table.

Figures (7)

  • Figure 3: MRAG3.0 architecture integrates document screenshots during the document parsing and indexing stages to minimize information loss. At the input stage, it incorporates a Multimodal Search Planning module, unifying Visual Question Answering (VQA) and Retrieval-Augmented Generation (RAG) tasks while refining user query precision. At the output stage, the Multimodal Retrieval-Augmented Composition module enhances answer generation by transforming plain text into multimodal formats, thereby enriching information delivery.
  • Figure 5: Multimodal output in QA scenarios can be categorized into three distinct types. In sub-scenario I, the user's query can be fully addressed using only images or videos, without requiring supplementary textual information. Sub-scenario II involves a step-by-step explanation that combines text and images to ensure clarity and precision; omitting the images may lead to user confusion at specific steps. In sub-scenario III, supplementary images enrich the information conveyed in the answer, but their removal does not compromise the answer's accuracy.
  • Figure 6: Taxonomy of recent advancements in multimodal retrieval research.
  • Figure 7: The architectures of retriever in multimodal retrieval.
  • Figure 8: Taxonomy of recent advancements in multimodal generation research.
  • ...and 2 more figures