TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval

Xiangzhao Hao; Shijie Wang; Tianyu Yang; Tianyue Wang; Haiyun Guo; Jinqiao Wang

TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval

Xiangzhao Hao, Shijie Wang, Tianyu Yang, Tianyue Wang, Haiyun Guo, Jinqiao Wang

TL;DR

TRACE (Task-adaptive Reasoning And Compressing Embeddings) unifies generative reasoning with discriminative representation learning and exhibits remarkable zero-shot transferability to unseen domains and novel constraints.

Abstract

Universal Multimodal Retrieval requires unified embedding models capable of interpreting diverse user intents, ranging from simple keywords to complex compositional instructions. While Multimodal Large Language Models (MLLMs) possess strong reasoning capabilities, prevailing adaptations confine them to static encoders, underutilizing their generative potential. This encoder-only paradigm struggles with complex intents that demand logical deduction rather than superficial pattern matching. To address this, we introduce TRACE (Task-adaptive Reasoning And Compressing Embeddings). TRACE unifies generative reasoning with discriminative representation learning. It first generates a structured Chain-of-Thought (CoT) to explicitly reason about the query, and subsequently compresses this reasoning trace into a compact embedding via a dedicated token. To train this framework, we construct M-BEIR-CoT, a large-scale dataset featuring a difficulty-aware routing strategy. Experiments on the M-BEIR benchmark establish TRACE as the new state-of-the-art. Crucially, TRACE demonstrates a learned implicit routing behavior. It autonomously activates reasoning for complex queries while bypassing it for simpler ones, achieving an optimal balance between retrieval accuracy and inference throughput. Furthermore, by internalizing the deductive process, TRACE exhibits remarkable zero-shot transferability to unseen domains and novel constraints.

TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval

TL;DR

Abstract

Paper Structure (23 sections, 8 equations, 6 figures, 12 tables)

This paper contains 23 sections, 8 equations, 6 figures, 12 tables.

Introduction
Related Work
Evolution of Universal Multimodal Retrieval
Chain-of-Thought for Discriminative Tasks
Method
Problem Formulation
Construction of M-BEIR-CoT
The TRACE Framework
Experiments
Experimental Setup
Efficiency and Adaptive Analysis
Performance on Universal Retrieval
Scalability Across Backbones
Zero-Shot Generalization to Unseen Scenarios
Ablation Study
...and 8 more sections

Figures (6)

Figure 1: The TRACE Framework. TRACE learns a query-dependent inference strategy. (a) For simple queries, it implicitly bypasses the reasoning stage and directly extracts features to maintain high efficiency. (b) For complex queries, it automatically activates the task-adaptive reasoning process. The model generates an explicit reasoning trace cot to resolve semantic ambiguities before compressing this context into the final representation. (c) Performance comparison on the M-BEIR benchmark wei2023uniir demonstrates the effectiveness of TRACE, particularly on reasoning-intensive tasks.
Figure 2: The construction pipeline of the M-BEIR-CoT dataset. The process operates in three phases: (1) Query Complexity Assessment: An advanced MLLM assesses query difficulty, routing simple queries to a direct path (generating only <|emb|>) and complex queries to a reasoning path (generating CoT + <|emb|>). (2) Task-Specific CoT Generation: We design specialized prompts for diverse tasks (, captioning, text edit, VQA) to generate structured reasoning traces enclosed in <reasoning> tags. (3) Dual Filtering & Curation: To ensure data quality, we apply a coarse-to-fine strategy. We first use rule-based filtering to verify formats and lengths, followed by model-based filtering to ensure semantic consistency between the generated text and ground-truth targets.
Figure 3: Illustration of the TRACE architecture. The model processes a multimodal query through a frozen vision encoder and a trainable projector. The LLM acts as a unified reasoner and encoder. It first generates a Chain-of-Thought (CoT) cot to interpret the intent and then compresses the semantics into a learnable <|emb|> token. The final query feature is extracted from the hidden state immediately preceding <|emb|>. During training, the model is optimized jointly using Cross-Entropy (CE) loss for reasoning generation and InfoNCE loss oord2018representation for embedding alignment.
Figure 4: Visualization of Adaptive Activation.(Left) In-Domain Retrieval: TRACE dynamically toggles between a direct path and a reasoning path based on query complexity. (Right) Zero-Shot Generalization: The adaptive behavior effectively transfers to unseen domains and novel constraints.
Figure 5: Examples of task-specific prompts used in M-BEIR-CoT.
...and 1 more figures

TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval

TL;DR

Abstract

TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (6)