Table of Contents
Fetching ...

mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA

Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Yuxuan Zhao, Zehua Xie, Jin Ma, Ying Shan, Weiming Hu

TL;DR

A novel generalized framework called mR$^2$AG is proposed, which achieves adaptive retrieval and useful information localization to enable answers through two easy-to-implement reflection operations, preventing high model complexity.

Abstract

Advanced Multimodal Large Language Models (MLLMs) struggle with recent Knowledge-based VQA tasks, such as INFOSEEK and Encyclopedic-VQA, due to their limited and frozen knowledge scope, often leading to ambiguous and inaccurate responses. Thus, multimodal Retrieval-Augmented Generation (mRAG) is naturally introduced to provide MLLMs with comprehensive and up-to-date knowledge, effectively expanding the knowledge scope. However, current mRAG methods have inherent drawbacks, including: 1) Performing retrieval even when external knowledge is not needed. 2) Lacking of identification of evidence that supports the query. 3) Increasing model complexity due to additional information filtering modules or rules. To address these shortcomings, we propose a novel generalized framework called \textbf{m}ultimodal \textbf{R}etrieval-\textbf{R}eflection-\textbf{A}ugmented \textbf{G}eneration (mR$^2$AG), which achieves adaptive retrieval and useful information localization to enable answers through two easy-to-implement reflection operations, preventing high model complexity. In mR$^2$AG, Retrieval-Reflection is designed to distinguish different user queries and avoids redundant retrieval calls, and Relevance-Reflection is introduced to guide the MLLM in locating beneficial evidence of the retrieved content and generating answers accordingly. In addition, mR$^2$AG can be integrated into any well-trained MLLM with efficient fine-tuning on the proposed mR$^2$AG Instruction-Tuning dataset (mR$^2$AG-IT). mR$^2$AG significantly outperforms state-of-the-art MLLMs (e.g., GPT-4v/o) and RAG-based MLLMs on INFOSEEK and Encyclopedic-VQA, while maintaining the exceptional capabilities of base MLLMs across a wide range of Visual-dependent tasks.

mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA

TL;DR

A novel generalized framework called mRAG is proposed, which achieves adaptive retrieval and useful information localization to enable answers through two easy-to-implement reflection operations, preventing high model complexity.

Abstract

Advanced Multimodal Large Language Models (MLLMs) struggle with recent Knowledge-based VQA tasks, such as INFOSEEK and Encyclopedic-VQA, due to their limited and frozen knowledge scope, often leading to ambiguous and inaccurate responses. Thus, multimodal Retrieval-Augmented Generation (mRAG) is naturally introduced to provide MLLMs with comprehensive and up-to-date knowledge, effectively expanding the knowledge scope. However, current mRAG methods have inherent drawbacks, including: 1) Performing retrieval even when external knowledge is not needed. 2) Lacking of identification of evidence that supports the query. 3) Increasing model complexity due to additional information filtering modules or rules. To address these shortcomings, we propose a novel generalized framework called \textbf{m}ultimodal \textbf{R}etrieval-\textbf{R}eflection-\textbf{A}ugmented \textbf{G}eneration (mRAG), which achieves adaptive retrieval and useful information localization to enable answers through two easy-to-implement reflection operations, preventing high model complexity. In mRAG, Retrieval-Reflection is designed to distinguish different user queries and avoids redundant retrieval calls, and Relevance-Reflection is introduced to guide the MLLM in locating beneficial evidence of the retrieved content and generating answers accordingly. In addition, mRAG can be integrated into any well-trained MLLM with efficient fine-tuning on the proposed mRAG Instruction-Tuning dataset (mRAG-IT). mRAG significantly outperforms state-of-the-art MLLMs (e.g., GPT-4v/o) and RAG-based MLLMs on INFOSEEK and Encyclopedic-VQA, while maintaining the exceptional capabilities of base MLLMs across a wide range of Visual-dependent tasks.

Paper Structure

This paper contains 27 sections, 3 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Comparisons of different methods on Visual-dependent and Knowledge-based VQA tasks: 1) Typical MLLMs use the image $I$ and question $Q$ as inputs, offering limited support for Knowledge-based questions. . 2) Naive mRAG use $I$, $Q$, and retrieved content $P_{1,2,3}$ as inputs in all cases, inevitably introducing irrelevant noise. 3) mR$^2$AG adaptively determines the necessity of retrieval and effectively locates the useful context, i.e., $P_3$ for $Q_2$.
  • Figure 2: Overview of the mR$^2$AG framework. (a1) mR$^2$AG w/ Retrieval: This process includes: a) Retrieval-Reflection for determining the necessity of retrieval; b) Relevance-Reflection for identifying evidence passages; c) Post-processing multiple potential answers. (a2) mR$^2$AG w/o Retrieval: The generation process when retrieval is unnecessary. (b) Naïve mRAG: A baseline method without reflection.
  • Figure 3: Qualitative comparison of GPT-4o and mR$^2$AG on INFOSEEK dataset. Two failure cases are shown in the (c) and (d).
  • Figure 4: Qualitative results showing the effectiveness of the mR$^2$AG framework. The first row shows results from the INFOSEEK dataset, while the second row shows results from Enc-VQA.
  • Figure 5: Additional visualization results are provided: the first row shows examples from INFOSEEK; the second row shows examples from Enc-VQA, covering single-hop and multi-answer questions. The last column presents incorrect answers.