Table of Contents
Fetching ...

UniCoRN: Unified Commented Retrieval Network with LMMs

Maximilian Jaritz, Matthieu Guillaumin, Sabine Sternig, Loris Bazzani

TL;DR

UniCoRN tackles the challenge of combining complex visio-linguistic reasoning with retrieval by freezing a base Large Multimodal Model and introducing two interconnected modules: Comment-aware Retrieval and Retrieval-aware Generation. A trainable Entity Adapter injects retrieved multimodal content into the LMM's input stream, enabling coherent, grounded comments that accompany retrieved answers. The paper introduces the Commented Retrieval (CoR) task and two datasets, CIRR-CoR and Wiki-CoR, and demonstrates substantial improvements over state-of-the-art baselines in both retrieval (up to significant recall gains) and commenting (METEOR/BEM metrics) across multiple domains. The approach preserves the LMM’s original capabilities while adding retrieval and commenting functionality, offering a practical pathway for integrated multimodal search and explanatory QA in real-world scenarios.

Abstract

Multimodal retrieval methods have limitations in handling complex, compositional queries that require reasoning about the visual content of both the query and the retrieved entities. On the other hand, Large Multimodal Models (LMMs) can answer with language to more complex visual questions, but without the inherent ability to retrieve relevant entities to support their answers. We aim to address these limitations with UniCoRN, a Unified Commented Retrieval Network that combines the strengths of composed multimodal retrieval methods and generative language approaches, going beyond Retrieval-Augmented Generation (RAG). We introduce an entity adapter module to inject the retrieved multimodal entities back into the LMM, so it can attend to them while generating answers and comments. By keeping the base LMM frozen, UniCoRN preserves its original capabilities while being able to perform both retrieval and text generation tasks under a single integrated framework. To assess these new abilities, we introduce the Commented Retrieval task (CoR) and a corresponding dataset, with the goal of retrieving an image that accurately answers a given question and generate an additional textual response that provides further clarification and details about the visual information. We demonstrate the effectiveness of UniCoRN on several datasets showing improvements of +4.5% recall over the state of the art for composed multimodal retrieval and of +14.9% METEOR / +18.4% BEM over RAG for commenting in CoR.

UniCoRN: Unified Commented Retrieval Network with LMMs

TL;DR

UniCoRN tackles the challenge of combining complex visio-linguistic reasoning with retrieval by freezing a base Large Multimodal Model and introducing two interconnected modules: Comment-aware Retrieval and Retrieval-aware Generation. A trainable Entity Adapter injects retrieved multimodal content into the LMM's input stream, enabling coherent, grounded comments that accompany retrieved answers. The paper introduces the Commented Retrieval (CoR) task and two datasets, CIRR-CoR and Wiki-CoR, and demonstrates substantial improvements over state-of-the-art baselines in both retrieval (up to significant recall gains) and commenting (METEOR/BEM metrics) across multiple domains. The approach preserves the LMM’s original capabilities while adding retrieval and commenting functionality, offering a practical pathway for integrated multimodal search and explanatory QA in real-world scenarios.

Abstract

Multimodal retrieval methods have limitations in handling complex, compositional queries that require reasoning about the visual content of both the query and the retrieved entities. On the other hand, Large Multimodal Models (LMMs) can answer with language to more complex visual questions, but without the inherent ability to retrieve relevant entities to support their answers. We aim to address these limitations with UniCoRN, a Unified Commented Retrieval Network that combines the strengths of composed multimodal retrieval methods and generative language approaches, going beyond Retrieval-Augmented Generation (RAG). We introduce an entity adapter module to inject the retrieved multimodal entities back into the LMM, so it can attend to them while generating answers and comments. By keeping the base LMM frozen, UniCoRN preserves its original capabilities while being able to perform both retrieval and text generation tasks under a single integrated framework. To assess these new abilities, we introduce the Commented Retrieval task (CoR) and a corresponding dataset, with the goal of retrieving an image that accurately answers a given question and generate an additional textual response that provides further clarification and details about the visual information. We demonstrate the effectiveness of UniCoRN on several datasets showing improvements of +4.5% recall over the state of the art for composed multimodal retrieval and of +14.9% METEOR / +18.4% BEM over RAG for commenting in CoR.

Paper Structure

This paper contains 21 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Commented retrieval. Given an query image and question, UniCoRN can retrieve an image from a database and can produce a textual answer that offers further clarification and details.
  • Figure 2: Comment-aware Retrieval. The query is inputed to both a CLIP-trained image-text encoder and an LMM. The LMM representation is projected to the space of the image-text encoder. Alignment of query and targets is done using contrastive loss.
  • Figure 3: Retrieval-aware Generation. The query image and text are fed to the LMM, which asks the retriever for relevant entities. The best entity is provided to the user and adapted into the LMM, so it can attend to it for generating a useful comment.
  • Figure 4: Qualitative results retrieval. We show retrieved images for UniIR and UniCoRN on three datasets. Captions are not displayed because of space limits.
  • Figure 5: Qualitative results. We show retrieved images and comments for UniIR with RAG and UniCoRN on two different datasets. Comments highlighted in red indicate responses that either do not answer the original question or are not related to the retrieved image.