Table of Contents
Fetching ...

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, Wei Ping

TL;DR

This work tackles universal multimodal retrieval by fine-tuning multimodal LLM-based bi-encoder retrievers to handle text, image, and interleaved queries across diverse tasks. Central innovations include modality-aware hard negative mining to mitigate modality bias, continuous text-to-text fine-tuning to preserve strong text retrieval, and a zero-shot reranking path using LLM prompts. MM-Embed achieves state-of-the-art performance on the M-BEIR multimodal benchmark and surpasses NV-Embed-v1 on MTEB text retrieval, while zero-shot reranking further enhances challenging multimodal tasks like CIRCO. The findings suggest strong potential for distillation and iterative, multimodal retrieval enhancements in real-world search systems.

Abstract

State-of-the-art retrieval models typically address a straightforward search scenario, in which retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and retrieved results. This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs), enabling a broader search scenario, termed universal multimodal retrieval, where multiple modalities and diverse retrieval tasks are accommodated. To this end, we first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks. Our empirical results show that the fine-tuned MLLM retriever is capable of understanding challenging queries, composed of both text and image, but it underperforms compared to a smaller CLIP retriever in cross-modal retrieval tasks due to the modality bias exhibited by MLLMs. To address the issue, we propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers. Second, we propose continuously fine-tuning the universal multimodal retriever to enhance its text retrieval capability while preserving multimodal retrieval capability. As a result, our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR, which spans multiple domains and tasks, while also surpassing the state-of-the-art text retrieval model, NV-Embed-v1, on the MTEB retrieval benchmark. We also explore prompting the off-the-shelf MLLMs as zero-shot rerankers to refine the ranking of the candidates from the multimodal retriever. We find that, through prompt-and-reranking, MLLMs can further improve multimodal retrieval when the user queries (e.g., text-image composed queries) are more complex and challenging to understand. These findings also pave the way for advancing universal multimodal retrieval in the future.

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

TL;DR

This work tackles universal multimodal retrieval by fine-tuning multimodal LLM-based bi-encoder retrievers to handle text, image, and interleaved queries across diverse tasks. Central innovations include modality-aware hard negative mining to mitigate modality bias, continuous text-to-text fine-tuning to preserve strong text retrieval, and a zero-shot reranking path using LLM prompts. MM-Embed achieves state-of-the-art performance on the M-BEIR multimodal benchmark and surpasses NV-Embed-v1 on MTEB text retrieval, while zero-shot reranking further enhances challenging multimodal tasks like CIRCO. The findings suggest strong potential for distillation and iterative, multimodal retrieval enhancements in real-world search systems.

Abstract

State-of-the-art retrieval models typically address a straightforward search scenario, in which retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and retrieved results. This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs), enabling a broader search scenario, termed universal multimodal retrieval, where multiple modalities and diverse retrieval tasks are accommodated. To this end, we first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks. Our empirical results show that the fine-tuned MLLM retriever is capable of understanding challenging queries, composed of both text and image, but it underperforms compared to a smaller CLIP retriever in cross-modal retrieval tasks due to the modality bias exhibited by MLLMs. To address the issue, we propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers. Second, we propose continuously fine-tuning the universal multimodal retriever to enhance its text retrieval capability while preserving multimodal retrieval capability. As a result, our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR, which spans multiple domains and tasks, while also surpassing the state-of-the-art text retrieval model, NV-Embed-v1, on the MTEB retrieval benchmark. We also explore prompting the off-the-shelf MLLMs as zero-shot rerankers to refine the ranking of the candidates from the multimodal retriever. We find that, through prompt-and-reranking, MLLMs can further improve multimodal retrieval when the user queries (e.g., text-image composed queries) are more complex and challenging to understand. These findings also pave the way for advancing universal multimodal retrieval in the future.

Paper Structure

This paper contains 25 sections, 1 equation, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Illustration of universal multimodal retrieval in (a), where each task consists of a task-specific instruction and query. Both queries and candidate documents are in heterogeneous formats (i.e., text, image or, interleaved text-image). In this work, we explore (a) fine-tuning MLLM-based universal multimodal retrievers and (b) prompting pre-trained MLLMs for zero-shot reranking over retrieved candidates. We adopt LLaVa-Next llavanext as our MLLM backbone.