Table of Contents
Fetching ...

U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs

Xiaojie Li, Chu Li, Shi-Zhe Chen, Xi Chen

TL;DR

This work addresses universal multimodal retrieval (UMR) by systematically dissecting design choices for embedding learning using Multimodal Large Language Models (MLLMs). It introduces a three-faceted study—embedding generation, InfoNCE-based training, and distillation—and culminates in the U-MARVEL framework, which achieves state-of-the-art performance on the M-BEIR benchmark in supervised settings and shows strong zero-shot generalization to tasks like text-to-video and composed image retrieval. Key contributions include identifying that bidirectional attention with mean pooling and instructional masking improve embeddings, demonstrating the importance of progressive transition and learnable temperature in contrastive learning, and presenting an efficient distillation approach that merges recall-rerank benefits into a single model. The framework’s efficacy across diverse tasks and its efficiency gains offer practical impact for deploying robust, scalable UMR systems, with potential extensions to additional modalities and RAG-style applications.

Abstract

Universal multimodal retrieval (UMR), which aims to address complex retrieval tasks where both queries and candidates span diverse modalities, has been significantly advanced by the emergence of MLLMs. While state-of-the-art MLLM-based methods in the literature predominantly adopt contrastive learning principles, they often differ in their specific training recipes. Despite their success, the mechanisms underlying their retrieval capabilities remain largely unexplored, potentially resulting in suboptimal performance and limited generalization ability. To address these issues, we present a comprehensive study aimed at uncovering the key factors that drive effective embedding learning for UMR using MLLMs. We begin by implementing a general MLLM-based embedding learning pipeline, and systematically analyze the primary contributors to high-performing universal retrieval systems. Based on this, we explore various aspects of the details in embedding generation and training strategies, including progressive transition, hard negative mining and re-ranker distillation. Notably, our findings reveal that often-overlooked factors can have a substantial impact on model performance. Building on these discoveries, we introduce a unified framework termed U-MARVEL (Universal MultimodAl RetrieVal via Embedding Learning), which outperforms state-of-the-art competitors on the M-BEIR benchmark by a large margin in supervised settings, and also exhibits strong zero-shot performance on several tasks such as composed image retrieval and text-to-video retrieval. These results underscore the generalization potential of our framework across various embedding-based retrieval tasks. Code is available at https://github.com/chaxjli/U-MARVEL

U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs

TL;DR

This work addresses universal multimodal retrieval (UMR) by systematically dissecting design choices for embedding learning using Multimodal Large Language Models (MLLMs). It introduces a three-faceted study—embedding generation, InfoNCE-based training, and distillation—and culminates in the U-MARVEL framework, which achieves state-of-the-art performance on the M-BEIR benchmark in supervised settings and shows strong zero-shot generalization to tasks like text-to-video and composed image retrieval. Key contributions include identifying that bidirectional attention with mean pooling and instructional masking improve embeddings, demonstrating the importance of progressive transition and learnable temperature in contrastive learning, and presenting an efficient distillation approach that merges recall-rerank benefits into a single model. The framework’s efficacy across diverse tasks and its efficiency gains offer practical impact for deploying robust, scalable UMR systems, with potential extensions to additional modalities and RAG-style applications.

Abstract

Universal multimodal retrieval (UMR), which aims to address complex retrieval tasks where both queries and candidates span diverse modalities, has been significantly advanced by the emergence of MLLMs. While state-of-the-art MLLM-based methods in the literature predominantly adopt contrastive learning principles, they often differ in their specific training recipes. Despite their success, the mechanisms underlying their retrieval capabilities remain largely unexplored, potentially resulting in suboptimal performance and limited generalization ability. To address these issues, we present a comprehensive study aimed at uncovering the key factors that drive effective embedding learning for UMR using MLLMs. We begin by implementing a general MLLM-based embedding learning pipeline, and systematically analyze the primary contributors to high-performing universal retrieval systems. Based on this, we explore various aspects of the details in embedding generation and training strategies, including progressive transition, hard negative mining and re-ranker distillation. Notably, our findings reveal that often-overlooked factors can have a substantial impact on model performance. Building on these discoveries, we introduce a unified framework termed U-MARVEL (Universal MultimodAl RetrieVal via Embedding Learning), which outperforms state-of-the-art competitors on the M-BEIR benchmark by a large margin in supervised settings, and also exhibits strong zero-shot performance on several tasks such as composed image retrieval and text-to-video retrieval. These results underscore the generalization potential of our framework across various embedding-based retrieval tasks. Code is available at https://github.com/chaxjli/U-MARVEL

Paper Structure

This paper contains 42 sections, 8 equations, 2 figures, 18 tables.

Figures (2)

  • Figure 1: U-MARVEL Exploration: We conduct a comprehensive study on effective design principles for embedding model architectures and optimal strategies for training embedding models.
  • Figure 2: Distillation Illustration. It shows how the teacher model generates scores, and compares the traditional and improved distillation methods.