Table of Contents
Fetching ...

MMSearch-R1: Incentivizing LMMs to Search

Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, Ziwei Liu

TL;DR

This work tackles the challenge of keeping LMMs up-to-date with external knowledge for knowledge-intensive VQA. It introduces MMSearch-R1, an end-to-end reinforcement learning framework that teaches LMMs to search on demand using both image and text tools via Group Relative Policy Optimization. The authors build a multimodal search pipeline, curate a balanced FVQA dataset, and demonstrate that MMSearch-R1 can outperform same-size RAG baselines and approach the performance of larger models while significantly reducing search calls. Key findings include improved on-demand search behavior, better query generation and summarization, greater reliance on internal knowledge, and superior data efficiency compared with supervised fine-tuning, supported by extensive ablations and analysis. The work also releases data and tooling to spur future research in multimodal search-driven reasoning.

Abstract

Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information. Existing approaches such as retrieval-augmented generation (RAG) and prompt engineered search agents rely on rigid pipelines, often leading to inefficient or excessive search behaviors. We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty. To support training, We collect a multimodal search VQA dataset through a semi-automated pipeline that covers diverse visual and textual knowledge needs and curate a search-balanced subset with both search-required and search-free samples, which proves essential for shaping efficient and on-demand search behavior. Extensive experiments on knowledge-intensive and info-seeking VQA tasks show that our model not only outperforms RAG-based baselines of the same model size, but also matches the performance of a larger RAG-based model while reducing search calls by over 30%. We further analyze key empirical findings to offer actionable insights for advancing research in multimodal search.

MMSearch-R1: Incentivizing LMMs to Search

TL;DR

This work tackles the challenge of keeping LMMs up-to-date with external knowledge for knowledge-intensive VQA. It introduces MMSearch-R1, an end-to-end reinforcement learning framework that teaches LMMs to search on demand using both image and text tools via Group Relative Policy Optimization. The authors build a multimodal search pipeline, curate a balanced FVQA dataset, and demonstrate that MMSearch-R1 can outperform same-size RAG baselines and approach the performance of larger models while significantly reducing search calls. Key findings include improved on-demand search behavior, better query generation and summarization, greater reliance on internal knowledge, and superior data efficiency compared with supervised fine-tuning, supported by extensive ablations and analysis. The work also releases data and tooling to spur future research in multimodal search-driven reasoning.

Abstract

Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information. Existing approaches such as retrieval-augmented generation (RAG) and prompt engineered search agents rely on rigid pipelines, often leading to inefficient or excessive search behaviors. We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty. To support training, We collect a multimodal search VQA dataset through a semi-automated pipeline that covers diverse visual and textual knowledge needs and curate a search-balanced subset with both search-required and search-free samples, which proves essential for shaping efficient and on-demand search behavior. Extensive experiments on knowledge-intensive and info-seeking VQA tasks show that our model not only outperforms RAG-based baselines of the same model size, but also matches the performance of a larger RAG-based model while reducing search calls by over 30%. We further analyze key empirical findings to offer actionable insights for advancing research in multimodal search.

Paper Structure

This paper contains 45 sections, 2 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Overview of MMSearch-R1. MMSearch-R1 learns to recognize the boundaries of its knowledge and perform on-demand search, significantly reducing the number of searches required while outperforming RAG-based models on knowledge-intensive and info-seeking VQA tasks.
  • Figure 2: Illustration of training in MMSearch-R1. Top: The GRPO training pipeline integrated with multimodal search tools. Bottom: A detailed view of the rollout process and search tool execution.
  • Figure 3: Illustration of data construction process of FVQA: (a). An automated pipeline for visual knowledge-required VQA samples collection; (b). Knowledge taxonomy; (c). Overall pipeline showing the composition and origin of FVQA from various automatic and manually curated sources.
  • Figure 4: (a). Performance comparison between the Base model and the RL-trained model under the RAG workflow. (b). Answer behavior breakdown of Base (inner circle) and RL (outer circle) models in InfoSeek and SimpleVQA.
  • Figure 5: (a). Performance improvements of SFT and RL over Base across five VQA datasets. (b). Training dynamics of reward and search ratio for different strategies
  • ...and 5 more figures