Table of Contents
Fetching ...

mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, Zhicheng Dou

TL;DR

mmE5 tackles data scarcity in multimodal multilingual embeddings by introducing a principled synthetic data framework guided by broad scope, robust cross-modal alignment, and high fidelity. The authors implement a one-pass deep-thinking generation with a multimodal LLM to produce 560K high-quality samples across 93 languages and seven modality combinations, then finetune Llama-3.2-Vision with LoRA using an InfoNCE objective. Results show SOTA performance on MMEB in zero-shot and supervised settings and strong multilingual retrieval on XTD, with demonstrated transferability to other base MLLMs. The work highlights data efficiency, cross-modal coherence, and multilingual generalization, while acknowledging reliance on GPT-4o and the cost of scaling synthetic data. Overall, mmE5 provides a scalable blueprint for converting large language models into high-quality multimodal embedding tutors across languages and tasks $-$ a meaningful step toward universal multimodal representations.

Abstract

Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark. Our codes, datasets and models are released in https://github.com/haon-chen/mmE5.

mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

TL;DR

mmE5 tackles data scarcity in multimodal multilingual embeddings by introducing a principled synthetic data framework guided by broad scope, robust cross-modal alignment, and high fidelity. The authors implement a one-pass deep-thinking generation with a multimodal LLM to produce 560K high-quality samples across 93 languages and seven modality combinations, then finetune Llama-3.2-Vision with LoRA using an InfoNCE objective. Results show SOTA performance on MMEB in zero-shot and supervised settings and strong multilingual retrieval on XTD, with demonstrated transferability to other base MLLMs. The work highlights data efficiency, cross-modal coherence, and multilingual generalization, while acknowledging reliance on GPT-4o and the cost of scaling synthetic data. Overall, mmE5 provides a scalable blueprint for converting large language models into high-quality multimodal embedding tutors across languages and tasks a meaningful step toward universal multimodal representations.

Abstract

Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark. Our codes, datasets and models are released in https://github.com/haon-chen/mmE5.

Paper Structure

This paper contains 25 sections, 1 equation, 9 figures, 7 tables.

Figures (9)

  • Figure 1: An illustration of our data synthesis framework. "X$\rightarrow$Y" denotes a modality combination, where "X" represents the query side and "Y" denotes the target side. "T" denotes text and "I" denotes image.
  • Figure 2: An illustration of our method. We take the generation of an IT$\rightarrow$IT retrieval data sample as an example.
  • Figure 3: Distribution of languages in the synthetic data.
  • Figure 4: The impact of synthetic data size on multimodal embedding performance on MMEB.
  • Figure 5: The zero-shot performances of mmE5 with different training settings on MMEB (280K synthetic data for efficient test).
  • ...and 4 more figures