MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline

Huanjin Yao; Qixiang Yin; Min Yang; Ziwang Zhao; Yibo Wang; Haotian Luo; Jingyi Zhang; Jiaxing Huang

MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline

Huanjin Yao, Qixiang Yin, Min Yang, Ziwang Zhao, Yibo Wang, Haotian Luo, Jingyi Zhang, Jiaxing Huang

TL;DR

This work proposes Hyper-Search, a hypergraph-based QA generation method that models and connects visual and textual nodes within and across modalities, enabling to generate search-intensive multimodal QA pairs that require invoking various search tools to solve, and introduces DR-TTS, which decomposes search-involved tasks into several categories according to search tool types.

Abstract

We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis, enabling it to conduct deep research tasks. However, we observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs. To tackle them, we first propose Hyper-Search, a hypergraph-based QA generation method that models and connects visual and textual nodes within and across modalities, enabling to generate search-intensive multimodal QA pairs that require invoking various search tools to solve. Second, we introduce DR-TTS, which first decomposes search-involved tasks into several categories according to search tool types, and respectively optimize specialized search tool experts for each tool. It then recomposes tool experts to jointly explore search trajectories via tree search, producing trajectories that successfully solve complex tasks using various search tools. Third, we build an offline search engine supporting multiple search tools, enabling agentic reinforcement learning without using costly online search APIs. With the three designs, we develop MM-DeepResearch, a powerful multimodal deep research agent, and extensive results shows its superiority across benchmarks. Code is available at https://github.com/HJYao00/MM-DeepResearch

MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline

TL;DR

Abstract

Paper Structure (34 sections, 3 equations, 11 figures, 8 tables)

This paper contains 34 sections, 3 equations, 11 figures, 8 tables.

Introduction
Related Work
Agentic MLLMs
Workflow-based Search Agents
Multimodal Deep Research Agents
MM-DeepResearch Data
Hyper-Search for QA Data Generation
Search Hypergraph Construction
Multimodal Search QA Generation
Search QA Filtering
DR-TTS for Search Trajectory Synthesis
MM-DeepResearch Agents
SFT with Multi-turn Search Tool Invocation
Reinforcement Learning with offline Search Engine
Evaluation Protocol
...and 19 more sections

Figures (11)

Figure 1: Overall performance of MM-DeepResearch-8B compares to other models across four benchmarks.
Figure 2: Overview of Hyper-Search for generating search-intensive QA data via hypergraph construction, QA generation, and filtering.
Figure 3: Overview of Decompose–Recompose Tool Tree Search for exploring search trajectories.
Figure 4: Data examples from the history category.
Figure 5: Data examples from the art category.
...and 6 more figures

MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline

TL;DR

Abstract

MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline

Authors

TL;DR

Abstract

Table of Contents

Figures (11)