ReMatch: Boosting Representation through Matching for Multimodal Retrieval
Qianying Liu, Xiao Liang, Zhiqiang Zhang, Zhongfei Qing, Fengfan Zhou, Yibo Chen, Xu Tang, Yao Hu, Paul Henderson
TL;DR
ReMatch addresses the limitation of treating multimodal large language models as mere encoders by leveraging their generative capabilities for embedding learning. It introduces a chat-style multimodal matching objective and learnable multi-token augmentation to produce fine-grained, orthogonally diverse embeddings, coupled with an efficient multi-view attention scheme. The method jointly optimizes a contrastive objective and a discriminative matching loss, achieving state-of-the-art results on MMEB and strong zero-shot transfer across diverse retrieval tasks. This work demonstrates that integrating generative reasoning with discriminative embedding learning can yield robust, transferable multimodal representations suitable for retrieval and beyond.
Abstract
We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional reasoning and world knowledge. We instead train the embedding MLLM end-to-end with a chat-style generative matching stage. The matching stage uses the same MLLM to autoregressively decide relevance from multi-view inputs, including both raw data and its own projected embeddings for each query and document. It provides instance-wise discrimination supervision that complements a standard contrastive loss, offering stronger gradients on hard negatives and preserving the compositional strengths of the original MLLM. To obtain semantically richer multimodal embeddings, we use multiple learnable tokens to augment each input, generating fine-grained contextual, mutually orthogonal embeddings with low inference cost. Leveraging our established high-performance baseline,we assemble the ideas mentioned above into a powerful training recipe and achieve a new state-of-the-art on the Massive Multimodal Embedding Benchmark (MMEB). Our experiments show particularly strong zero-shot generalization results on five datasets, highlighting the robustness and transferability of ReMatch.
