mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

Haonan Chen; Liang Wang; Nan Yang; Yutao Zhu; Ziliang Zhao; Furu Wei; Zhicheng Dou

mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, Zhicheng Dou

TL;DR

mmE5 tackles data scarcity in multimodal multilingual embeddings by introducing a principled synthetic data framework guided by broad scope, robust cross-modal alignment, and high fidelity. The authors implement a one-pass deep-thinking generation with a multimodal LLM to produce 560K high-quality samples across 93 languages and seven modality combinations, then finetune Llama-3.2-Vision with LoRA using an InfoNCE objective. Results show SOTA performance on MMEB in zero-shot and supervised settings and strong multilingual retrieval on XTD, with demonstrated transferability to other base MLLMs. The work highlights data efficiency, cross-modal coherence, and multilingual generalization, while acknowledging reliance on GPT-4o and the cost of scaling synthetic data. Overall, mmE5 provides a scalable blueprint for converting large language models into high-quality multimodal embedding tutors across languages and tasks $-$ a meaningful step toward universal multimodal representations.

Abstract

Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark. Our codes, datasets and models are released in https://github.com/haon-chen/mmE5.

mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

TL;DR

Abstract

mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)