Table of Contents
Fetching ...

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, Jiankang Deng

TL;DR

This work tackles CLIP's limitations in token capacity, cross-modal fusion, and compositional reasoning by proposing UniME, a two-stage framework that leverages Multimodal Large Language Models. The first stage performs textual discriminative knowledge distillation from a strong LLM-based teacher to enhance the language component's embeddings, while the second stage uses hard negative enhanced instruction tuning with false-negative filtering and hard negative sampling to boost discrimination and instruction-following. Extensive MMEB-driven experiments show consistent improvements across short and long caption retrieval, compositional retrieval, and general multimodal retrieval, highlighting enhanced discriminative power and compositional understanding. The approach advances universal multimodal embedding learning and offers practical benefits for vision-language tasks requiring detailed descriptions and robust cross-modal reasoning.

Abstract

The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLMś language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

TL;DR

This work tackles CLIP's limitations in token capacity, cross-modal fusion, and compositional reasoning by proposing UniME, a two-stage framework that leverages Multimodal Large Language Models. The first stage performs textual discriminative knowledge distillation from a strong LLM-based teacher to enhance the language component's embeddings, while the second stage uses hard negative enhanced instruction tuning with false-negative filtering and hard negative sampling to boost discrimination and instruction-following. Extensive MMEB-driven experiments show consistent improvements across short and long caption retrieval, compositional retrieval, and general multimodal retrieval, highlighting enhanced discriminative power and compositional understanding. The approach advances universal multimodal embedding learning and offers practical benefits for vision-language tasks requiring detailed descriptions and robust cross-modal reasoning.

Abstract

The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLMś language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.

Paper Structure

This paper contains 42 sections, 4 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: The framework of the Textual Discriminative Knowledge Distillation stage. We leverage the state-of-the-art LLM-based embedding model to enhance the discriminative capabilities of the MLLM's language component.
  • Figure 2: The framework of the Hard Negative Enhanced Instruction Tuning stage. We further improve the discriminative capabilities of the MLLM through false negative filtering and hard negative sampling.
  • Figure 3: The discrimination comparison between E5-V and UniME$^\dagger$. $^\dagger$ represents the UniME model only training on the first textual discrimination knowledge distillation stage.
  • Figure 4: The comparison of training loss and pre-clip gradient norms for hard negatives, easy negatives, and random sample negatives.
  • Figure 5: Visualization of the top-k next predicted tokens before and after different training stages based on Phi3.5-V. $^\dagger$: UniME with textual discrimination distillation only. $^\ddagger$: UniME with both textual discrimination distillation and hard negative enhanced instruction tuning.
  • ...and 2 more figures