MMSearch-R1: Incentivizing LMMs to Search

Jinming Wu; Zihao Deng; Wei Li; Yiding Liu; Bo You; Bo Li; Zejun Ma; Ziwei Liu

MMSearch-R1: Incentivizing LMMs to Search

Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, Ziwei Liu

TL;DR

This work tackles the challenge of keeping LMMs up-to-date with external knowledge for knowledge-intensive VQA. It introduces MMSearch-R1, an end-to-end reinforcement learning framework that teaches LMMs to search on demand using both image and text tools via Group Relative Policy Optimization. The authors build a multimodal search pipeline, curate a balanced FVQA dataset, and demonstrate that MMSearch-R1 can outperform same-size RAG baselines and approach the performance of larger models while significantly reducing search calls. Key findings include improved on-demand search behavior, better query generation and summarization, greater reliance on internal knowledge, and superior data efficiency compared with supervised fine-tuning, supported by extensive ablations and analysis. The work also releases data and tooling to spur future research in multimodal search-driven reasoning.

Abstract

Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information. Existing approaches such as retrieval-augmented generation (RAG) and prompt engineered search agents rely on rigid pipelines, often leading to inefficient or excessive search behaviors. We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty. To support training, We collect a multimodal search VQA dataset through a semi-automated pipeline that covers diverse visual and textual knowledge needs and curate a search-balanced subset with both search-required and search-free samples, which proves essential for shaping efficient and on-demand search behavior. Extensive experiments on knowledge-intensive and info-seeking VQA tasks show that our model not only outperforms RAG-based baselines of the same model size, but also matches the performance of a larger RAG-based model while reducing search calls by over 30%. We further analyze key empirical findings to offer actionable insights for advancing research in multimodal search.

MMSearch-R1: Incentivizing LMMs to Search

TL;DR

Abstract

MMSearch-R1: Incentivizing LMMs to Search

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)