Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs
Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
TL;DR
This work addresses the knowledge limitations of multimodal LLMs by introducing Wiki-LLaVA, which augments a base MLLM with an external multimodal knowledge base accessed through a hierarchical retrieval pipeline. A two-stage retrieval first selects relevant documents by image-to-title similarity, then extracts the most pertinent passages from those documents to form contextual prompts for the LLM. The approach combines a CLIP-based visual encoder, a Contriever-based passage retriever, and LoRA-fine-tuning to integrate retrieved content without major architectural changes, and is evaluated on Encyclopedic-VQA and InfoSeek. Experimental results show that retrieved external knowledge substantially improves accuracy on knowledge-intensive VQA tasks, with performance benefiting from multiple retrieved passages and now approaching oracle-level guidance in favorable settings. This work demonstrates the viability of retrieval-augmented MLLMs and lays groundwork for more flexible, domain-adaptive multimodal reasoning systems.
Abstract
Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at integrating an external knowledge source of multimodal documents, which is accessed through a hierarchical retrieval pipeline. Relevant passages, using this approach, are retrieved from the external knowledge source and employed as additional context for the LLM, augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach.
