ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
Alberto Compagnoni, Marco Morini, Sara Sarto, Federico Cocchi, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
TL;DR
This work tackles knowledge-based visual question answering by integrating a two-stage retrieval pipeline with a critic that filters noisy passages and a reasoning-enabled generator trained via a cold-start supervised phase followed by reinforcement learning. The model produces explicit reasoning traces grounded in retrieved evidence, delivering improved answer accuracy on Encyclopedic-VQA and InfoSeek while offering interpretability. Key contributions include a multi-level retrieval strategy, a passage-relevance critic, and a GRPO-inspired RL framework that enhances multimodal reasoning over external knowledge. ReAG demonstrates state-of-the-art performance and explainability, with code released for reproducibility and broader adoption of knowledge-grounded multimodal reasoning.
Abstract
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: https://github.com/aimagelab/ReAG.
