Table of Contents
Fetching ...

PLUME: Latent Reasoning Based Universal Multimodal Embedding

Chenwei He, Xiangzhao Hao, Tianyu Yang, Yuxiang Ma, Yuheng Jia, Lingxiang Wu, Chaoyang Zhao, Haiyun Guo, Jinqiao Wang

Abstract

Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this behavior into hidden-state computation, eliminating explicit CoT at inference. On the 78-task MMEB-v2 benchmark, PLUME outperforms strong explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference. PLUME is especially well suited to retrieval settings where relevant evidence is dense, structurally complex, and difficult to organize through verbalized intermediate rationales, such as video and visual document retrieval. These results show that structured latent computation can preserve the benefits of intermediate reasoning without the overhead of explicit rationale generation, providing a stronger and more efficient paradigm for practical retrieval systems.

PLUME: Latent Reasoning Based Universal Multimodal Embedding

Abstract

Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this behavior into hidden-state computation, eliminating explicit CoT at inference. On the 78-task MMEB-v2 benchmark, PLUME outperforms strong explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference. PLUME is especially well suited to retrieval settings where relevant evidence is dense, structurally complex, and difficult to organize through verbalized intermediate rationales, such as video and visual document retrieval. These results show that structured latent computation can preserve the benefits of intermediate reasoning without the overhead of explicit rationale generation, providing a stronger and more efficient paradigm for practical retrieval systems.

Paper Structure

This paper contains 19 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: PLUME achieves a favorable accuracy--efficiency tradeoff on MMEB-v2. The x-axis shows inference throughput on a single H20 GPU and the y-axis shows average MMEB-v2 performance.
  • Figure 2: Comparison of three universal multimodal embedding paradigms. Left: early discriminative UME forms embeddings through single pass encoding, preserving efficiency but without explicitly modeling intermediate reasoning. Middle: explicit CoT UME improves reasoning by generating long textual traces before embedding extraction, but incurs substantial inference latency and token cost. Right: PLUME internalizes reasoning into a compact latent rollout and adapts the reasoning path with semantic-anchor-guided expert routing, achieving reasoning-aware embedding with substantially lower inference cost.
  • Figure 3: Overview of PLUME. Starting from a multimodal prefix, PLUME replaces explicit CoT decoding with a compact latent rollout inside the backbone. The bottom panel illustrates the latent rollout process, where the model performs several latent transitions before extracting the final retrieval embedding from the hidden state at <gen>. The top-left panel expands the semantic-anchor-guided transition adapter, which routes each latent step through shared and specialized experts, while the top-right panel shows the progressive explicit-to-latent curriculum that gradually rewrites explicit reasoning segments into latent blocks across training stages. The example in the bottom panel corresponds to an intermediate curriculum stage.
  • Figure 4: Per task performance comparison on MMEB-v2.
  • Figure 5: Activation preferences of specialized experts across image and video retrieval sub-tasks.
  • ...and 1 more figures