Table of Contents
Fetching ...

Multimodal Iterative RAG for Knowledge-Intensive Visual Question Answering

Changin Choi, Wonseok Lee, Jungmin Ko, Wonjong Rhee

TL;DR

PMSR introduces a Progressive Multimodal Search and Reasoning framework for knowledge-intensive VQA that constructs a trajectory of compact reasoning records and uses dual-scope queries to iteratively retrieve diverse knowledge from textual and multimodal KBs. By condensing evidence into reusable reasoning records and terminating adaptively via information saturation, PMSR achieves more stable reasoning and reduced error drift compared to static RAG and agent-based methods. Extensive experiments across six benchmarks show consistent retrieval recall and end-to-end accuracy gains, with stronger backbones yielding higher improvements and cross-domain robustness demonstrated on OK-VQA, FVQA, InfoSeek, and E-VQA. The work highlights the value of structured reasoning trajectories for guided retrieval and synthesis in multimodal QA, while acknowledging computational overhead and sensitivity to retriever quality as areas for future enhancement.

Abstract

Knowledge-intensive visual question answering (VQA) requires external knowledge beyond image content, demanding precise visual grounding and coherent integration of visual and textual information. Although multimodal retrieval-augmented generation has achieved notable advances by incorporating external knowledge bases, existing approaches largely adopt single-pass frameworks that often fail to acquire sufficient knowledge and lack mechanisms to revise misdirected reasoning. We propose PMSR (Progressive Multimodal Search and Reasoning), a framework that progressively constructs a structured reasoning trajectory to enhance both knowledge acquisition and synthesis. PMSR uses dual-scope queries conditioned on both the latest record and the trajectory to retrieve diverse knowledge from heterogeneous knowledge bases. The retrieved evidence is then synthesized into compact records via compositional reasoning. This design facilitates controlled iterative refinement, which supports more stable reasoning trajectories with reduced error propagation. Extensive experiments across six diverse benchmarks (Encyclopedic-VQA, InfoSeek, MMSearch, LiveVQA, FVQA, and OK-VQA) demonstrate that PMSR consistently improves both retrieval recall and end-to-end answer accuracy.

Multimodal Iterative RAG for Knowledge-Intensive Visual Question Answering

TL;DR

PMSR introduces a Progressive Multimodal Search and Reasoning framework for knowledge-intensive VQA that constructs a trajectory of compact reasoning records and uses dual-scope queries to iteratively retrieve diverse knowledge from textual and multimodal KBs. By condensing evidence into reusable reasoning records and terminating adaptively via information saturation, PMSR achieves more stable reasoning and reduced error drift compared to static RAG and agent-based methods. Extensive experiments across six benchmarks show consistent retrieval recall and end-to-end accuracy gains, with stronger backbones yielding higher improvements and cross-domain robustness demonstrated on OK-VQA, FVQA, InfoSeek, and E-VQA. The work highlights the value of structured reasoning trajectories for guided retrieval and synthesis in multimodal QA, while acknowledging computational overhead and sensitivity to retriever quality as areas for future enhancement.

Abstract

Knowledge-intensive visual question answering (VQA) requires external knowledge beyond image content, demanding precise visual grounding and coherent integration of visual and textual information. Although multimodal retrieval-augmented generation has achieved notable advances by incorporating external knowledge bases, existing approaches largely adopt single-pass frameworks that often fail to acquire sufficient knowledge and lack mechanisms to revise misdirected reasoning. We propose PMSR (Progressive Multimodal Search and Reasoning), a framework that progressively constructs a structured reasoning trajectory to enhance both knowledge acquisition and synthesis. PMSR uses dual-scope queries conditioned on both the latest record and the trajectory to retrieve diverse knowledge from heterogeneous knowledge bases. The retrieved evidence is then synthesized into compact records via compositional reasoning. This design facilitates controlled iterative refinement, which supports more stable reasoning trajectories with reduced error propagation. Extensive experiments across six diverse benchmarks (Encyclopedic-VQA, InfoSeek, MMSearch, LiveVQA, FVQA, and OK-VQA) demonstrate that PMSR consistently improves both retrieval recall and end-to-end answer accuracy.

Paper Structure

This paper contains 28 sections, 10 equations, 1 figure, 9 tables.

Figures (1)

  • Figure 1: Overview of PMSR with the reasoning trajectory update loop at iteration $t$. PMSR consists of three stages: initial record generation, iterative reasoning trajectory updates, and adaptive termination. At each iteration, the reasoning trajectory update loop generates dual-scope queries conditioned on the latest reasoning record and the trajectory, retrieves knowledge from heterogeneous textual and multimodal KBs, and synthesizes the retrieved candidates into a new reasoning record. The newly generated record is appended to the trajectory to guide subsequent iterations. The process terminates adaptively when further iterations provide limited additional evidence.