Table of Contents
Fetching ...

Think Then Embed: Generative Context Improves Multimodal Embedding

Xuanming Cui, Jianpeng Cheng, Hong-you Chen, Satya Narayan Shukla, Abhijeet Awasthi, Xichen Pan, Chaitanya Ahuja, Shlok Kumar Mishra, Yonghuan Yang, Jun Xiao, Qi Guo, Ser-Nam Lim, Aashu Singh, Xiangjun Fan

TL;DR

The paper tackles the challenge of instruction-aware universal multimodal embeddings by moving beyond encoder-only usage of multimodal LLMs. It introduces Think-Then-Embed (TTE), a two-stage framework where a reasoner generates Embedding-Centric Reasoning traces that condition an embedder to produce task-specific embeddings, inspired by chain-of-thought reasoning. The authors demonstrate state-of-the-art MMEB-V2 results with a large teacher reasoner, show strong open-source performance via finetuned smaller reasoners, and propose unified reasoner-embedder architectures with pluggable embedding heads to improve efficiency. Across MMEB-V1 and MMEB-V2, TTE consistently improves retrieval, VQA, and grounding tasks without extra data, highlighting the practical impact of integrating explicit reasoning into multimodal embedding learning. Overall, the work validates that generative context can meaningfully enhance embedding quality and efficiency, enabling more flexible, instruction-aware multimodal retrieval systems.

Abstract

There is a growing interest in Universal Multimodal Embeddings (UME), where models are required to generate task-specific representations. While recent studies show that Multimodal Large Language Models (MLLMs) perform well on such tasks, they treat MLLMs solely as encoders, overlooking their generative capacity. However, such an encoding paradigm becomes less effective as instructions become more complex and require compositional reasoning. Inspired by the proven effectiveness of chain-of-thought reasoning, we propose a general Think-Then-Embed (TTE) framework for UME, composed of a reasoner and an embedder. The reasoner MLLM first generates reasoning traces that explain complex queries, followed by an embedder that produces representations conditioned on both the original query and the intermediate reasoning. This explicit reasoning step enables more nuanced understanding of complex multimodal instructions. Our contributions are threefold. First, by leveraging a powerful MLLM reasoner, we achieve state-of-the-art performance on the MMEB-V2 benchmark, surpassing proprietary models trained on massive in-house datasets. Second, to reduce the dependency on large MLLM reasoners, we finetune a smaller MLLM reasoner using high-quality embedding-centric reasoning traces, achieving the best performance among open-source models with a 7% absolute gain over recently proposed models. Third, we investigate strategies for integrating the reasoner and embedder into a unified model for improved efficiency without sacrificing performance.

Think Then Embed: Generative Context Improves Multimodal Embedding

TL;DR

The paper tackles the challenge of instruction-aware universal multimodal embeddings by moving beyond encoder-only usage of multimodal LLMs. It introduces Think-Then-Embed (TTE), a two-stage framework where a reasoner generates Embedding-Centric Reasoning traces that condition an embedder to produce task-specific embeddings, inspired by chain-of-thought reasoning. The authors demonstrate state-of-the-art MMEB-V2 results with a large teacher reasoner, show strong open-source performance via finetuned smaller reasoners, and propose unified reasoner-embedder architectures with pluggable embedding heads to improve efficiency. Across MMEB-V1 and MMEB-V2, TTE consistently improves retrieval, VQA, and grounding tasks without extra data, highlighting the practical impact of integrating explicit reasoning into multimodal embedding learning. Overall, the work validates that generative context can meaningfully enhance embedding quality and efficiency, enabling more flexible, instruction-aware multimodal retrieval systems.

Abstract

There is a growing interest in Universal Multimodal Embeddings (UME), where models are required to generate task-specific representations. While recent studies show that Multimodal Large Language Models (MLLMs) perform well on such tasks, they treat MLLMs solely as encoders, overlooking their generative capacity. However, such an encoding paradigm becomes less effective as instructions become more complex and require compositional reasoning. Inspired by the proven effectiveness of chain-of-thought reasoning, we propose a general Think-Then-Embed (TTE) framework for UME, composed of a reasoner and an embedder. The reasoner MLLM first generates reasoning traces that explain complex queries, followed by an embedder that produces representations conditioned on both the original query and the intermediate reasoning. This explicit reasoning step enables more nuanced understanding of complex multimodal instructions. Our contributions are threefold. First, by leveraging a powerful MLLM reasoner, we achieve state-of-the-art performance on the MMEB-V2 benchmark, surpassing proprietary models trained on massive in-house datasets. Second, to reduce the dependency on large MLLM reasoners, we finetune a smaller MLLM reasoner using high-quality embedding-centric reasoning traces, achieving the best performance among open-source models with a 7% absolute gain over recently proposed models. Third, we investigate strategies for integrating the reasoner and embedder into a unified model for improved efficiency without sacrificing performance.

Paper Structure

This paper contains 23 sections, 6 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Given a multi-modal input, we want to first think about the desired embedding content. The representation is conditioned on both original input and the thinking result.
  • Figure 2: Pipeline comparison between existing MLLM-based embedding (a) and proposed approach (b, c).
  • Figure 3: Embedding head designs: (a) Attention Pooling with learnable query. (b) NV-Embed-style nv_embed pooler. (c) Qformer-style embedding head, and (d) Embedding head with self attention blocks initialized from the backbone MLLM. Green denotes trainable components. Q, K and V denote query, key, and value in attention mechanism. MHSA refers to MultiHead Self Attention, and HTML]DAE8FC refers to output embedding.
  • Figure 4: Baseline (2B) with/without zero-shot ECR on MMEB-V1.
  • Figure 5: Results on T2T evaluation on generated ECR, versus $\text{TTE}_s$ and VLM2Vec-V1 on MMEB V1.
  • ...and 4 more figures