Table of Contents
Fetching ...

UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Jinsong Su

TL;DR

UME-R1 proposes a two-stage approach to unify discriminative and generative multimodal embeddings using reasoning-driven generation. Stage 1 performs supervised fine-tuning with reasoning to enable generation, producing both embedding types, while Stage 2 applies reinforcement learning with verifiable rewards to optimize generative embeddings, including a novel reward design that combines format and embedding quality. Across MMEB-V2’s 78 tasks, generative embeddings yield substantial gains over purely discriminative ones, with the oracle showing additional potential from mode switching. The work demonstrates the feasibility and benefits of reasoning-driven generative multimodal embeddings and opens paths for inference-time scaling and adaptive embedding strategies.

Abstract

The remarkable success of multimodal large language models (MLLMs) has driven advances in multimodal embeddings, yet existing models remain inherently discriminative, limiting their ability to benefit from reasoning-driven generation paradigm. In this work, we pioneer the exploration of generative embeddings, unifying embedding tasks within a generative paradigm. We propose UME-R1, a universal multimodal embedding framework consisting of a two-stage training strategy: a cold-start supervised fine-tuning equips the model with reasoning capabilities and enables it to generate both discriminative and generative embeddings; a subsequent reinforcement learning enhances reasoning and further optimizes generative embedding quality. This pioneering work reveals four key insights: 1) generative embeddings unlock substantial performance gains over conventional discriminative embeddings by leveraging the powerful generative reasoning capabilities of MLLMs; 2) discriminative and generative embeddings are complementary, whose combined oracle performance far exceeding that of either alone; 3) RL can effectively enhance generative embeddings, establishing a scalable optimization paradigm.; 4) repeated sampling at inference boosts downstream task coverage (pass@k), highlighting the inference-time scalability potential of generative embeddings. Evaluated on the MMEB-V2 benchmark across 78 tasks spanning video, image, and visual documents, UME-R1 significantly outperforms conventional discriminative embedding models and offers a foundation for more interpretable, reasoning-driven generative multimodal embeddings. Our code, models, and datasets will be publicly available at https://github.com/XMUDeepLIT/UME-R1.

UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

TL;DR

UME-R1 proposes a two-stage approach to unify discriminative and generative multimodal embeddings using reasoning-driven generation. Stage 1 performs supervised fine-tuning with reasoning to enable generation, producing both embedding types, while Stage 2 applies reinforcement learning with verifiable rewards to optimize generative embeddings, including a novel reward design that combines format and embedding quality. Across MMEB-V2’s 78 tasks, generative embeddings yield substantial gains over purely discriminative ones, with the oracle showing additional potential from mode switching. The work demonstrates the feasibility and benefits of reasoning-driven generative multimodal embeddings and opens paths for inference-time scaling and adaptive embedding strategies.

Abstract

The remarkable success of multimodal large language models (MLLMs) has driven advances in multimodal embeddings, yet existing models remain inherently discriminative, limiting their ability to benefit from reasoning-driven generation paradigm. In this work, we pioneer the exploration of generative embeddings, unifying embedding tasks within a generative paradigm. We propose UME-R1, a universal multimodal embedding framework consisting of a two-stage training strategy: a cold-start supervised fine-tuning equips the model with reasoning capabilities and enables it to generate both discriminative and generative embeddings; a subsequent reinforcement learning enhances reasoning and further optimizes generative embedding quality. This pioneering work reveals four key insights: 1) generative embeddings unlock substantial performance gains over conventional discriminative embeddings by leveraging the powerful generative reasoning capabilities of MLLMs; 2) discriminative and generative embeddings are complementary, whose combined oracle performance far exceeding that of either alone; 3) RL can effectively enhance generative embeddings, establishing a scalable optimization paradigm.; 4) repeated sampling at inference boosts downstream task coverage (pass@k), highlighting the inference-time scalability potential of generative embeddings. Evaluated on the MMEB-V2 benchmark across 78 tasks spanning video, image, and visual documents, UME-R1 significantly outperforms conventional discriminative embedding models and offers a foundation for more interpretable, reasoning-driven generative multimodal embeddings. Our code, models, and datasets will be publicly available at https://github.com/XMUDeepLIT/UME-R1.

Paper Structure

This paper contains 31 sections, 7 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Illustration of the pipeline for data construction. Specific prompts used for CoT annotation and the resulting data samples are presented in Appendix \ref{['appendix:data']}.
  • Figure 2: Overview of UME-R1. UME-R1 introduces a two-stage training framework for generative multimodal embedding. (a) Supervised fine-tuning uses query-target pairs with reasoning annotations to train the MLLM, enabling it to generate both discriminative and generative embeddings as well as to possess basic reasoning abilities. (b) RLVR continues to fine-tune the model using regular query-target pairs, encouraging it to generate reasoning trajectories that lead to more beneficial generative embeddings.
  • Figure 3: pass@$k$ curves of UME-2B and UME-7B across multiple datasets.
  • Figure 4: Comparison between DUME, DUME+Gen, and UME-R1. DUME+Gen denotes the approach in which an external model first generates reasoning and summaries, followed by DUME to obtain the corresponding embeddings.
  • Figure 5: Example from the constructed cold-start dataset (Case 1). The orange part represents the original data, the blue part denotes the added prompt, the black part indicates the reasoning content, and the green part shows the summary. orange segments correspond to the original data, blue segments represent the added prompts, black segments capture the reasoning process, and green segments provide the summaries.
  • ...and 11 more figures