Table of Contents
Fetching ...

MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model

Geonmo Gu, Byeongho Heo, Jaemyung Yu, Jaehui Hwang, Taekyung Kim, Sangmin Lee, HeeJae Jun, Yoohoon Kang, Sangdoo Yun, Dongyoon Han

TL;DR

MuCo reframes multimodal embedding learning as a multi-turn, dialogue-like contrastive process, enabling a single forward pass to produce multiple related embeddings conditioned on a shared image context. By concatenating multiple query–target pairs per image and extracting several embeddings with dedicated prompt tokens, MuCo achieves a dramatically increased effective batch size with modest increases in compute, addressing both contextual coherence and scalability. Pretraining on the large M3T dataset and a guided in-context reconstruction fine-tuning strategy yield state-of-the-art results on MMEB and M-BEIR across model scales, while ablations demonstrate the importance of compounded supervision, logit masking, and data composition. This approach significantly improves both performance and training efficiency, highlighting a practical route to robust, context-aware universal multimodal embeddings.

Abstract

Universal Multimodal embedding models built on Multimodal Large Language Models (MLLMs) have traditionally employed contrastive learning, which aligns representations of query-target pairs across different modalities. Yet, despite its empirical success, they are primarily built on a "single-turn" formulation where each query-target pair is treated as an independent data point. This paradigm leads to computational inefficiency when scaling, as it requires a separate forward pass for each pair and overlooks potential contextual relationships between multiple queries that can relate to the same context. In this work, we introduce Multi-Turn Contrastive Learning (MuCo), a dialogue-inspired framework that revisits this process. MuCo leverages the conversational nature of MLLMs to process multiple, related query-target pairs associated with a single image within a single forward pass. This allows us to extract a set of multiple query and target embeddings simultaneously, conditioned on a shared context representation, amplifying the effective batch size and overall training efficiency. Experiments exhibit MuCo with a newly curated 5M multimodal multi-turn dataset (M3T), which yields state-of-the-art retrieval performance on MMEB and M-BEIR benchmarks, while markedly enhancing both training efficiency and representation coherence across modalities. Code and M3T are available at https://github.com/naver-ai/muco

MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model

TL;DR

MuCo reframes multimodal embedding learning as a multi-turn, dialogue-like contrastive process, enabling a single forward pass to produce multiple related embeddings conditioned on a shared image context. By concatenating multiple query–target pairs per image and extracting several embeddings with dedicated prompt tokens, MuCo achieves a dramatically increased effective batch size with modest increases in compute, addressing both contextual coherence and scalability. Pretraining on the large M3T dataset and a guided in-context reconstruction fine-tuning strategy yield state-of-the-art results on MMEB and M-BEIR across model scales, while ablations demonstrate the importance of compounded supervision, logit masking, and data composition. This approach significantly improves both performance and training efficiency, highlighting a practical route to robust, context-aware universal multimodal embeddings.

Abstract

Universal Multimodal embedding models built on Multimodal Large Language Models (MLLMs) have traditionally employed contrastive learning, which aligns representations of query-target pairs across different modalities. Yet, despite its empirical success, they are primarily built on a "single-turn" formulation where each query-target pair is treated as an independent data point. This paradigm leads to computational inefficiency when scaling, as it requires a separate forward pass for each pair and overlooks potential contextual relationships between multiple queries that can relate to the same context. In this work, we introduce Multi-Turn Contrastive Learning (MuCo), a dialogue-inspired framework that revisits this process. MuCo leverages the conversational nature of MLLMs to process multiple, related query-target pairs associated with a single image within a single forward pass. This allows us to extract a set of multiple query and target embeddings simultaneously, conditioned on a shared context representation, amplifying the effective batch size and overall training efficiency. Experiments exhibit MuCo with a newly curated 5M multimodal multi-turn dataset (M3T), which yields state-of-the-art retrieval performance on MMEB and M-BEIR benchmarks, while markedly enhancing both training efficiency and representation coherence across modalities. Code and M3T are available at https://github.com/naver-ai/muco
Paper Structure (18 sections, 7 equations, 9 figures, 18 tables)

This paper contains 18 sections, 7 equations, 9 figures, 18 tables.

Figures (9)

  • Figure 1: Comparison of conventional single-turn vs. our multi-turn contrastive learning. (a) Conventional contrastive learning employs a single query-target pair per image, using negative targets from other images to learn discriminative representations. (b) Our multi-turn contrastive learning (MuCo) generalizes this paradigm by using multiple query-target pairs per image, with expanded negative targets corresponding to expanded targets (from other images), which enables the model to learn more discriminative embeddings. Notably, for the same number of encoder forward passes, MuCo provides $k$-times larger effective batch size than conventional contrastive learning.
  • Figure 2: Overview of MuCo. With a multiple query-target paired dataset, MuCo intuitively structures the input by sequentially arranging the pairs as distinct dialogue turns. Lines are drawn between extracted embeddings, where blue lines denote a positive pair and orange lines denote negative pairs. For clarity, lines are shown originating only from the earliest turn of the first sample in the batch. Solid lines represent the pair set used by conventional methods, while dotted lines represent the augmented pairs (i.e., more learning signals) contributed by MuCo framework. For visual clarity, we omit the embedding function notation (e.g.$f(\cdot)$).
  • Figure 3: Logit masking strategy in our MuCo framework. (a) The conventional method with a batch size of $N=4$ yields a $N \times N$ (i.e.$4 \times 4$) matrix. In contrast, MuCo (b) uses a batch size of $N=2$ and $k=4$ turns to construct a larger $Nk \times Nk$ (i.e.$8 \times 8$) matrix. Crucially, our method masks out pairs originating from the same image (gray, $-\infty$) to prevent a semantic overlap issue. True positives (blue, $+$) and true negatives (orange, $-$) are used for the loss. Crucially, other pairs originating from the same image (gray, $-\infty$) are masked to prevent a semantic overlap issue.
  • Figure 4: Multi-turn template for fine-tuning MuCo on single-pair datasets. We illustrate Query (left) and Positive (right) templates. The initial query (cyan) is reused as a masked target on the Positive side, and the positive target (pink) becomes a masked target on the Query side. This process simulates multi-turn interactions from a single pair, guiding the model to reconstruct its counterpart and enrich the learned embeddings.
  • Figure A: Overview of the M3T Dataset Synthesis Pipeline. The data synthesis proceeds in two stages. First, an MLLM (Qwen2.5-VL-75B) is used to generate a comprehensive, objective dense caption of an image's rich visual information. Second, an LLM (OSS-20B) takes the dense caption as input to synthesize a diverse set of text-only query-positive pairs aligned with seven meta-tasks. Specifically, these include one for classification, one retrieval, and five distinct VQA pairs (two global VQA, two local VQA, and one creative VQA). For the retrieval task, the dense caption itself serves as the positive target, which is abbreviated with "..." for clarity in the figure.
  • ...and 4 more figures