Table of Contents
Fetching ...

ReMatch: Boosting Representation through Matching for Multimodal Retrieval

Qianying Liu, Xiao Liang, Zhiqiang Zhang, Zhongfei Qing, Fengfan Zhou, Yibo Chen, Xu Tang, Yao Hu, Paul Henderson

TL;DR

ReMatch addresses the limitation of treating multimodal large language models as mere encoders by leveraging their generative capabilities for embedding learning. It introduces a chat-style multimodal matching objective and learnable multi-token augmentation to produce fine-grained, orthogonally diverse embeddings, coupled with an efficient multi-view attention scheme. The method jointly optimizes a contrastive objective and a discriminative matching loss, achieving state-of-the-art results on MMEB and strong zero-shot transfer across diverse retrieval tasks. This work demonstrates that integrating generative reasoning with discriminative embedding learning can yield robust, transferable multimodal representations suitable for retrieval and beyond.

Abstract

We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional reasoning and world knowledge. We instead train the embedding MLLM end-to-end with a chat-style generative matching stage. The matching stage uses the same MLLM to autoregressively decide relevance from multi-view inputs, including both raw data and its own projected embeddings for each query and document. It provides instance-wise discrimination supervision that complements a standard contrastive loss, offering stronger gradients on hard negatives and preserving the compositional strengths of the original MLLM. To obtain semantically richer multimodal embeddings, we use multiple learnable tokens to augment each input, generating fine-grained contextual, mutually orthogonal embeddings with low inference cost. Leveraging our established high-performance baseline,we assemble the ideas mentioned above into a powerful training recipe and achieve a new state-of-the-art on the Massive Multimodal Embedding Benchmark (MMEB). Our experiments show particularly strong zero-shot generalization results on five datasets, highlighting the robustness and transferability of ReMatch.

ReMatch: Boosting Representation through Matching for Multimodal Retrieval

TL;DR

ReMatch addresses the limitation of treating multimodal large language models as mere encoders by leveraging their generative capabilities for embedding learning. It introduces a chat-style multimodal matching objective and learnable multi-token augmentation to produce fine-grained, orthogonally diverse embeddings, coupled with an efficient multi-view attention scheme. The method jointly optimizes a contrastive objective and a discriminative matching loss, achieving state-of-the-art results on MMEB and strong zero-shot transfer across diverse retrieval tasks. This work demonstrates that integrating generative reasoning with discriminative embedding learning can yield robust, transferable multimodal representations suitable for retrieval and beyond.

Abstract

We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional reasoning and world knowledge. We instead train the embedding MLLM end-to-end with a chat-style generative matching stage. The matching stage uses the same MLLM to autoregressively decide relevance from multi-view inputs, including both raw data and its own projected embeddings for each query and document. It provides instance-wise discrimination supervision that complements a standard contrastive loss, offering stronger gradients on hard negatives and preserving the compositional strengths of the original MLLM. To obtain semantically richer multimodal embeddings, we use multiple learnable tokens to augment each input, generating fine-grained contextual, mutually orthogonal embeddings with low inference cost. Leveraging our established high-performance baseline,we assemble the ideas mentioned above into a powerful training recipe and achieve a new state-of-the-art on the Massive Multimodal Embedding Benchmark (MMEB). Our experiments show particularly strong zero-shot generalization results on five datasets, highlighting the robustness and transferability of ReMatch.

Paper Structure

This paper contains 35 sections, 5 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Illustration of our motivation. (a) Previous embedding methods like VLM2Vec-V2 disrupt the inherent fine-grained grounding of pretrained MLLMs. In contrast, our approach effectively preserves this critical alignment. (b) Our framework combines multiple learnable tokens with a generative matching objective to produce fine-grained and discriminative embeddings.
  • Figure 2: Previous multimodal retrieval frameworks v.s. our ReMatch. Upper Left: Single token retrieval method outputs an embedding for each pair of query and doc corresponding the [EOS] position, and uses contrastive objective to maximize the similarity for corresponding pairs. Upper Right: our framework first augments the input with Learnable Tokens and obtains multi-vector representations at these learnable-token positions. Then orthogonal regularization are employed on these representations and fuse into one embedding for every query or doc which are optimized by contrastive objective. The output embeddings are adapted by a MLP projector into MLLM input distribution, which used by our matching loss. Lower: we propose Query Doc Matching strategy to add point-wise discriminative signals in framework from original input and feature perspectives.
  • Figure 3: We introduce a unified attention mask that enables the model process all eight raw/embedding query–document combinations in one forward pass. Each answer token attends only to its paired query–document inputs (both raw and embedded) and the instruction prompt, preserving standard next-token prediction behavior. By randomizing which of $d^1,d^2$ holds the $d^+$, any positional leakage of relevance signals is prevented.
  • Figure 4: Qualitative Comparison of Visual Grounding Results on the Visual7W-Pointing (OOD) Dataset.
  • Figure 5: VQA retrieval Results on Place365 (OOD) and ScienceQA (OOD). Comparison between ours Exp6 and the baseline.
  • ...and 7 more figures