Table of Contents
Fetching ...

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, Yongping Xiong

TL;DR

MegaPairs tackles the data bottleneck in universal multimodal retrieval by generating a massive, diverse set of instruction-bearing image pairs from open-domain images using heterogeneous similarity signals. The resulting MegaPairs dataset enables MMRet, available in CLIP-based and MLLM-based forms, to achieve state-of-the-art zero-shot performance on multiple CIR benchmarks and the MMEB suite, often with far less training data than prior methods. Through a combination of scalable data synthesis, hard negatives, and multimodal contrastive learning, the approach demonstrates strong generalization and downstream fine-tuning gains, with publicly released assets to accelerate the field. This work highlights a practical path to continuously improve retrieval systems without relying on privately-curated datasets.

Abstract

Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synthesis method that leverages vision language models (VLMs) and open-domain images, together with a massive synthetic dataset generated from this method. Our empirical analysis shows that MegaPairs generates high-quality data, enabling the multimodal retriever to significantly outperform the baseline model trained on 70$\times$ more data from existing datasets. Moreover, since MegaPairs solely relies on general image corpora and open-source VLMs, it can be easily scaled up, enabling continuous improvements in retrieval performance. In this stage, we produced more than 26 million training instances and trained several models of varying sizes using this data. These new models achieve state-of-the-art zero-shot performance across 4 popular composed image retrieval (CIR) benchmarks and the highest overall performance on the 36 datasets provided by MMEB. They also demonstrate notable performance improvements with additional downstream fine-tuning. Our produced dataset, well-trained models, and data synthesis pipeline will be made publicly available to facilitate the future development of this field.

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

TL;DR

MegaPairs tackles the data bottleneck in universal multimodal retrieval by generating a massive, diverse set of instruction-bearing image pairs from open-domain images using heterogeneous similarity signals. The resulting MegaPairs dataset enables MMRet, available in CLIP-based and MLLM-based forms, to achieve state-of-the-art zero-shot performance on multiple CIR benchmarks and the MMEB suite, often with far less training data than prior methods. Through a combination of scalable data synthesis, hard negatives, and multimodal contrastive learning, the approach demonstrates strong generalization and downstream fine-tuning gains, with publicly released assets to accelerate the field. This work highlights a practical path to continuously improve retrieval systems without relying on privately-curated datasets.

Abstract

Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synthesis method that leverages vision language models (VLMs) and open-domain images, together with a massive synthetic dataset generated from this method. Our empirical analysis shows that MegaPairs generates high-quality data, enabling the multimodal retriever to significantly outperform the baseline model trained on 70 more data from existing datasets. Moreover, since MegaPairs solely relies on general image corpora and open-source VLMs, it can be easily scaled up, enabling continuous improvements in retrieval performance. In this stage, we produced more than 26 million training instances and trained several models of varying sizes using this data. These new models achieve state-of-the-art zero-shot performance across 4 popular composed image retrieval (CIR) benchmarks and the highest overall performance on the 36 datasets provided by MMEB. They also demonstrate notable performance improvements with additional downstream fine-tuning. Our produced dataset, well-trained models, and data synthesis pipeline will be made publicly available to facilitate the future development of this field.

Paper Structure

This paper contains 39 sections, 4 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Construction pipeline of multimodal triplets: (a) mining of image pairs, (b) generation of open-ended instructions. Multiple similarity models are used to introduce diversified correlations for the image pairs.
  • Figure 2: Performance scaling of MMRet-base on the MegaPairs as data size increases. The dashed lines indicate the performance of MagicLens-B (CLIP) trained on their dataset of 36.7M data pairs.
  • Figure 3: The specific prompts for MLLM. The value of WORD_NUM ranges from 60 to 100 in our practical data generation to enhance the diversity of the generated description.
  • Figure 4: The specific prompts for LLM. The figure showcases two demonstrations, while in our practical data generation process, five demonstrations are randomly selected from a pool of 50 and fed into the LLM.
  • Figure 5: The visualized examples of MegaPairs. Each row represents a single example, with the query item highlighted in a blue rectangle and the target items enclosed within a dashed box.
  • ...and 1 more figures