Table of Contents
Fetching ...

V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

Dongyang Chen, Chaoyang Wang, Dezhao SU, Xi Xiao, Zeyu Zhang, Jing Xiong, Qing Li, Yuzhang Shang, Shichao Ka

TL;DR

V-Retrver tackles the limitation of language-only reasoning in universal multimodal retrieval by enabling an agentic reasoning loop that actively inspects visual evidence via external tools. It introduces Multimodal Interleaved Evidence Reasoning (MIER) and a three-stage curriculum-based training with an Evidence-Aligned Policy Optimization (EAPO) objective, including a composite reward $R_i = α\, r_{format}(o_i) + β\, r_{rank}(o_i) + r_{tool}(o_i)$. Empirically, it achieves state-of-the-art performance on the M-BEIR benchmark (average Recall $R@5=69.7$) and demonstrates strong generalization on unseen datasets, validating the effectiveness of grounded, interactive visual reasoning for retrieval. These results underscore the potential of agentic MLLMs to perform more precise, evidence-based multimodal reasoning, with practical impact on large-scale retrieval systems and downstream RAG-style tasks.

Abstract

Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.

V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

TL;DR

V-Retrver tackles the limitation of language-only reasoning in universal multimodal retrieval by enabling an agentic reasoning loop that actively inspects visual evidence via external tools. It introduces Multimodal Interleaved Evidence Reasoning (MIER) and a three-stage curriculum-based training with an Evidence-Aligned Policy Optimization (EAPO) objective, including a composite reward . Empirically, it achieves state-of-the-art performance on the M-BEIR benchmark (average Recall ) and demonstrates strong generalization on unseen datasets, validating the effectiveness of grounded, interactive visual reasoning for retrieval. These results underscore the potential of agentic MLLMs to perform more precise, evidence-based multimodal reasoning, with practical impact on large-scale retrieval systems and downstream RAG-style tasks.

Abstract

Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.
Paper Structure (28 sections, 10 equations, 11 figures, 9 tables, 2 algorithms)

This paper contains 28 sections, 10 equations, 11 figures, 9 tables, 2 algorithms.

Figures (11)

  • Figure 1: Comparison between text-based CoT (left) and multimodal interleaved CoT (right) for multimodal retrieval. Text-based CoT relies on language-driven inference over static visual representations, often failing to resolve fine-grained differences. In contrast, V-Retrver performs multimodal interleaved CoT reasoning by invoking visual tools to inspect candidate images, enabling grounded reasoning and more reliable ranking decisions.
  • Figure 2: Overview of the V-Retrver framework. The left panel illustrates the inference pipeline, featuring a coarse-to-fine process with embedding-based retrieval and agentic reranking. The right panel details the three training stages we proposed, including Cold Start, Rejection sampling Fine-Tuning, and EAPO.
  • Figure 3: RL Training curves.
  • Figure 4: System Prompt template for training and inference.
  • Figure 5: User Prompt template for training and inference.
  • ...and 6 more figures