Table of Contents
Fetching ...

Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

TL;DR

This work addresses the knowledge limitations of multimodal LLMs by introducing Wiki-LLaVA, which augments a base MLLM with an external multimodal knowledge base accessed through a hierarchical retrieval pipeline. A two-stage retrieval first selects relevant documents by image-to-title similarity, then extracts the most pertinent passages from those documents to form contextual prompts for the LLM. The approach combines a CLIP-based visual encoder, a Contriever-based passage retriever, and LoRA-fine-tuning to integrate retrieved content without major architectural changes, and is evaluated on Encyclopedic-VQA and InfoSeek. Experimental results show that retrieved external knowledge substantially improves accuracy on knowledge-intensive VQA tasks, with performance benefiting from multiple retrieved passages and now approaching oracle-level guidance in favorable settings. This work demonstrates the viability of retrieval-augmented MLLMs and lays groundwork for more flexible, domain-adaptive multimodal reasoning systems.

Abstract

Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at integrating an external knowledge source of multimodal documents, which is accessed through a hierarchical retrieval pipeline. Relevant passages, using this approach, are retrieved from the external knowledge source and employed as additional context for the LLM, augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach.

Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

TL;DR

This work addresses the knowledge limitations of multimodal LLMs by introducing Wiki-LLaVA, which augments a base MLLM with an external multimodal knowledge base accessed through a hierarchical retrieval pipeline. A two-stage retrieval first selects relevant documents by image-to-title similarity, then extracts the most pertinent passages from those documents to form contextual prompts for the LLM. The approach combines a CLIP-based visual encoder, a Contriever-based passage retriever, and LoRA-fine-tuning to integrate retrieved content without major architectural changes, and is evaluated on Encyclopedic-VQA and InfoSeek. Experimental results show that retrieved external knowledge substantially improves accuracy on knowledge-intensive VQA tasks, with performance benefiting from multiple retrieved passages and now approaching oracle-level guidance in favorable settings. This work demonstrates the viability of retrieval-augmented MLLMs and lays groundwork for more flexible, domain-adaptive multimodal reasoning systems.

Abstract

Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at integrating an external knowledge source of multimodal documents, which is accessed through a hierarchical retrieval pipeline. Relevant passages, using this approach, are retrieved from the external knowledge source and employed as additional context for the LLM, augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach.
Paper Structure (12 sections, 3 equations, 3 figures, 4 tables)

This paper contains 12 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Comparison between a standard multimodal LLM and Wiki-LLaVa. Our model integrates knowledge retrieved from an external knowledge base of documents through a hierarchical retrieval pipeline. As a result, it provides more precise answers when tasked with questions that require external knowledge.
  • Figure 2: Overview of the architecture of Wiki-LLaVA, which augments a multimodal LLM with external knowledge through a hierarchical retrieval pipeline.
  • Figure 3: Qualitative results on sample image-question pairs from Encyclopedic-VQA (first row) and InfoSeek (second row) comparing the proposed approach with the original LLaVA-1.5 model. Some failure cases are shown in the third row with the corresponding ground-truth.