Table of Contents
Fetching ...

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

Alberto Compagnoni, Marco Morini, Sara Sarto, Federico Cocchi, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

TL;DR

This work tackles knowledge-based visual question answering by integrating a two-stage retrieval pipeline with a critic that filters noisy passages and a reasoning-enabled generator trained via a cold-start supervised phase followed by reinforcement learning. The model produces explicit reasoning traces grounded in retrieved evidence, delivering improved answer accuracy on Encyclopedic-VQA and InfoSeek while offering interpretability. Key contributions include a multi-level retrieval strategy, a passage-relevance critic, and a GRPO-inspired RL framework that enhances multimodal reasoning over external knowledge. ReAG demonstrates state-of-the-art performance and explainability, with code released for reproducibility and broader adoption of knowledge-grounded multimodal reasoning.

Abstract

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: https://github.com/aimagelab/ReAG.

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

TL;DR

This work tackles knowledge-based visual question answering by integrating a two-stage retrieval pipeline with a critic that filters noisy passages and a reasoning-enabled generator trained via a cold-start supervised phase followed by reinforcement learning. The model produces explicit reasoning traces grounded in retrieved evidence, delivering improved answer accuracy on Encyclopedic-VQA and InfoSeek while offering interpretability. Key contributions include a multi-level retrieval strategy, a passage-relevance critic, and a GRPO-inspired RL framework that enhances multimodal reasoning over external knowledge. ReAG demonstrates state-of-the-art performance and explainability, with code released for reproducibility and broader adoption of knowledge-grounded multimodal reasoning.

Abstract

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: https://github.com/aimagelab/ReAG.

Paper Structure

This paper contains 23 sections, 11 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Comparison between Zero-Shot (ZS) MLLMs, retrieval-augmented models, and ReAG. ZS MLLMs lack specialized knowledge and fail on domain-specific queries (top). Retrieval-augmented models introduce external context but often add noisy or irrelevant passages (middle). ReAG overcomes this with a filtering stage over retrieved content and a multi-stage training strategy to enhance reasoning over passages.
  • Figure 2: Overview of the proposed ReAG model. A multi-level retriever module extracts noisy passages, which are refined by a critic model. The resulting relevant passages are fed to a generator trained via SFT and a reinforcement learning stage designed for the KB-VQA task.
  • Figure 3: Comparison of performance on E-VQA with and without evidence, including oracle upper bounds (left). Analysis on average number of passages retained at different top‑$k$ values (right).
  • Figure 4: Qualitative results on InfoSeek image-question pairs comparing ReAG, ReflectiVA cocchi2025augmenting, and the corresponding zero-shot model.
  • Figure 5: Task‑specific accuracy reward progression across training iterations of the ReAG 7B generator.
  • ...and 5 more figures