Table of Contents
Fetching ...

MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains

Xuying Ning, Dongqi Fu, Tianxin Wei, Mengting Ai, Jiaru Zou, Ting-Wei Li, Hanghang Tong, Yada Zhu, Hendrik Hamann, Jingrui He

TL;DR

MC-Search is presented, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures, and Search-Align is introduced, a process-supervised fine-tuning framework leveraging verified reasoning chains, showing that the data not only enables faithful evaluation but also improves planning and retrieval fidelity in open-source MLLMs.

Abstract

With the increasing demand for step-wise, cross-modal, and knowledge-grounded reasoning, multimodal large language models (MLLMs) are evolving beyond the traditional fixed retrieve-then-generate paradigm toward more sophisticated agentic multimodal retrieval-augmented generation (MM-RAG). Existing benchmarks, however, mainly focus on simplified QA with short retrieval chains, leaving adaptive planning and multimodal reasoning underexplored. We present MC-Search, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures. Each example specifies sub-questions, retrieval modalities, supporting facts, and intermediate answers, with fidelity ensured by HAVE (Hop-wise Attribution and Verification of Evidence), resulting in 3,333 high-quality examples averaging 3.7 hops. Beyond answer accuracy, MC-Search introduces new process-level metrics for reasoning quality, stepwise retrieval and planning accuracy. By developing a unified agentic MM-RAG pipeline, we benchmark six leading MLLMs and reveal systematic issues such as over- and under-retrieval and modality-misaligned planning. Finally, we introduce Search-Align, a process-supervised fine-tuning framework leveraging verified reasoning chains, showing that our data not only enables faithful evaluation but also improves planning and retrieval fidelity in open-source MLLMs.

MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains

TL;DR

MC-Search is presented, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures, and Search-Align is introduced, a process-supervised fine-tuning framework leveraging verified reasoning chains, showing that the data not only enables faithful evaluation but also improves planning and retrieval fidelity in open-source MLLMs.

Abstract

With the increasing demand for step-wise, cross-modal, and knowledge-grounded reasoning, multimodal large language models (MLLMs) are evolving beyond the traditional fixed retrieve-then-generate paradigm toward more sophisticated agentic multimodal retrieval-augmented generation (MM-RAG). Existing benchmarks, however, mainly focus on simplified QA with short retrieval chains, leaving adaptive planning and multimodal reasoning underexplored. We present MC-Search, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures. Each example specifies sub-questions, retrieval modalities, supporting facts, and intermediate answers, with fidelity ensured by HAVE (Hop-wise Attribution and Verification of Evidence), resulting in 3,333 high-quality examples averaging 3.7 hops. Beyond answer accuracy, MC-Search introduces new process-level metrics for reasoning quality, stepwise retrieval and planning accuracy. By developing a unified agentic MM-RAG pipeline, we benchmark six leading MLLMs and reveal systematic issues such as over- and under-retrieval and modality-misaligned planning. Finally, we introduce Search-Align, a process-supervised fine-tuning framework leveraging verified reasoning chains, showing that our data not only enables faithful evaluation but also improves planning and retrieval fidelity in open-source MLLMs.
Paper Structure (61 sections, 5 equations, 10 figures, 14 tables)

This paper contains 61 sections, 5 equations, 10 figures, 14 tables.

Figures (10)

  • Figure 1: Example query requiring multimodal agentic search with a long reasoning chain.
  • Figure 2: Overview of MC-Search benchmark and evaluation. Left: Benchmark covering five reasoning topologies, filtered via the hop-wise attribution and verification of evidence (HAVE) process. Right: Multimodal agentic RAG pipeline, where an MLLM iteratively generates sub-queries and actions, retrieves multimodal evidence, reasons over the retrieved information, and integrates it to produce the final answer. Our framework further aligns predicted reasoning chains with golden trajectories to assess chain-level retrieval and planning.
  • Figure 3: Model F1 across samples with different chain lengths, showing a consistent drop in performance as the chain length increases.
  • Figure 4: Model $\Delta$F1 vs. $\Delta$ Step (difference in length between generated and golden reasoning chains), where larger positive $\Delta$steps indicate more over-retrieval.
  • Figure 5: Eight-way error taxonomy proportions (higher is worse), annotated by Gemini-2.5-Pro.
  • ...and 5 more figures