Table of Contents
Fetching ...

Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings

Haonan Jiang, Yuji Wang, Yongjie Zhu, Xin Lu, Wenyu Qin, Meng Wang, Pengfei Wan, Yansong Tang

TL;DR

Embed-RL tackles the misalignment between generative reasoning and embedding objectives in Universal Multimodal Embeddings by introducing an Embedder-Guided Reinforcement Learning framework that decouples the Embedder and Reasoner. It introduces Evidential Traceability CoT (T-CoT) to capture retrieval-relevant multimodal cues, guided by a three-component reward and optimized with GRPO. Data construction and strict filtering enable high-quality, reasoning-aligned embeddings trained on diverse modalities, achieving state-of-the-art or competitive results on MMEB-V2 and UVRB under limited compute. The work demonstrates that targeted, multimodal reasoning optimization yields substantial gains in cross-modal retrieval and generalization, with practical implications for scalable, reasoning-driven multimodal systems.

Abstract

Leveraging Multimodal Large Language Models (MLLMs) has become pivotal for advancing Universal Multimodal Embeddings (UME) in addressing diverse cross-modal tasks. Recent studies demonstrate that incorporating generative Chain-of-Thought (CoT) reasoning can substantially enhance task-specific representations compared to discriminative methods. However, the generated reasoning CoTs of existing generative embedding methods are limited to the textual analysis of queries and are irrelevant to the retrieval of the targets. To address these limitations, we propose a reasoning-driven UME framework that integrates Embedder-Guided Reinforcement Learning (EG-RL) to optimize the Reasoner to produce evidential Traceability CoT (T-CoT). Our key contributions are threefold: (1) We design an EG-RL framework where the Embedder provides explicit supervision to the Reasoner, ensuring the generated CoT traces are aligned with embedding tasks. (2) We introduce T-CoT, which extracts critical multimodal cues to focus on retrieval-relevant elements and provides multimodal inputs for the Embedder. (3) With limited computational resources, our framework outperforms the pioneering embedding model on both MMEB-V2 and UVRB benchmarks. The integration of multimodal evidence in structured reasoning, paired with retrieval-oriented alignment, effectively strengthens cross-modal semantic consistency and boosts the fine-grained matching capability of the model as well as the generalization across complex scenarios. Our work demonstrates that targeted reasoning optimization can significantly improve multimodal embedding quality, providing a practical and efficient solution for reasoning-driven UME development.

Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings

TL;DR

Embed-RL tackles the misalignment between generative reasoning and embedding objectives in Universal Multimodal Embeddings by introducing an Embedder-Guided Reinforcement Learning framework that decouples the Embedder and Reasoner. It introduces Evidential Traceability CoT (T-CoT) to capture retrieval-relevant multimodal cues, guided by a three-component reward and optimized with GRPO. Data construction and strict filtering enable high-quality, reasoning-aligned embeddings trained on diverse modalities, achieving state-of-the-art or competitive results on MMEB-V2 and UVRB under limited compute. The work demonstrates that targeted, multimodal reasoning optimization yields substantial gains in cross-modal retrieval and generalization, with practical implications for scalable, reasoning-driven multimodal systems.

Abstract

Leveraging Multimodal Large Language Models (MLLMs) has become pivotal for advancing Universal Multimodal Embeddings (UME) in addressing diverse cross-modal tasks. Recent studies demonstrate that incorporating generative Chain-of-Thought (CoT) reasoning can substantially enhance task-specific representations compared to discriminative methods. However, the generated reasoning CoTs of existing generative embedding methods are limited to the textual analysis of queries and are irrelevant to the retrieval of the targets. To address these limitations, we propose a reasoning-driven UME framework that integrates Embedder-Guided Reinforcement Learning (EG-RL) to optimize the Reasoner to produce evidential Traceability CoT (T-CoT). Our key contributions are threefold: (1) We design an EG-RL framework where the Embedder provides explicit supervision to the Reasoner, ensuring the generated CoT traces are aligned with embedding tasks. (2) We introduce T-CoT, which extracts critical multimodal cues to focus on retrieval-relevant elements and provides multimodal inputs for the Embedder. (3) With limited computational resources, our framework outperforms the pioneering embedding model on both MMEB-V2 and UVRB benchmarks. The integration of multimodal evidence in structured reasoning, paired with retrieval-oriented alignment, effectively strengthens cross-modal semantic consistency and boosts the fine-grained matching capability of the model as well as the generalization across complex scenarios. Our work demonstrates that targeted reasoning optimization can significantly improve multimodal embedding quality, providing a practical and efficient solution for reasoning-driven UME development.
Paper Structure (37 sections, 7 equations, 14 figures, 13 tables)

This paper contains 37 sections, 7 equations, 14 figures, 13 tables.

Figures (14)

  • Figure 1: Multimodal embedding optimization via Embedder-Guided Reinforcement Learning (EG-RL). (a) Frameworks evolution. (b) Reasoning enhancement with RL-optimized evidential Traceability CoT (T-CoT). (c) Comparison of multi-task performance.
  • Figure 2: Overview of the proposed data synthesis and EG-RL framework. (a) Data Construction generates T-CoT annotations for query-positive pairs, filters and splits the dataset to enable contrastive and reinforcement learning, laying the groundwork for reasoning-aware embedding. (b) Embedder-Guided Reinforcement Learning finetunes the MLLM with a process-outcome reward function, encouraging T-CoT trajectories that yield more discriminative and beneficial generative embeddings.
  • Figure 3: Example visualization of our reasoning-driven embedding framework on multimodal retrieval tasks. The figure shows the evidential Traceability CoT reasoning process for video and visual document retrieval.
  • Figure 4: Similarity difference $\Delta s = \text{sim}(\text{query}, \text{top1}) - \text{sim}(\text{query}, \text{top2})$ before and after EG-RL. Here, $\text{sim}(\cdot,\cdot)$ denotes cosine similarity of normalized embeddings, $\text{top1}$ is the most similar positive candidate and $\text{top2}$ the second-most similar. This metric quantifies the model’s discriminative ability over similar candidates on multimodal datasets.
  • Figure 5: Relationship between traceable evidence counts and retrieval metrics across datasets. Hit@1 is employed for Image and Video; NDCG@5 is used for VisDoc. Bounding box counts are shown for Image and VisDoc, while keyframe counts for Video.
  • ...and 9 more figures