Table of Contents
Fetching ...

MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval

Siyue Zhang, Yuan Gao, Xiao Zhou, Yilun Zhao, Tingyu Song, Arman Cohan, Anh Tuan Luu, Chen Zhao

TL;DR

MRMR introduces a realistic, expert-level multidisciplinary benchmark for reasoning-intensive multimodal retrieval, featuring 1,502 queries across 23 domains and interleaved image–text content with expert-verified positives. It defines three tasks—Knowledge, Theorem, and Contradiction—and adds a Contradiction Retrieval dimension to probe logical reasoning, including four specialized subtasks (Negation, Vehicle Design, Traffic Case). Across 14 frontier models and four retrieval paradigms, results show text-based captioned retrieval often surpasses multimodal methods on knowledge and reasoning tasks, while multimodal models exhibit large cross-domain gaps and limited reasoning abilities. The study demonstrates the value of large-language-model-assisted data construction and test-time reasoning expansion, yielding practical guidance for advancing multimodal retrieval in realistic, high-stakes domains.

Abstract

We introduce MRMR, the first expert-level multidisciplinary multimodal retrieval benchmark requiring intensive reasoning. MRMR contains 1,502 queries spanning 23 domains, with positive documents carefully verified by human experts. Compared to prior benchmarks, MRMR introduces three key advancements. First, it challenges retrieval systems across diverse areas of expertise, enabling fine-grained model comparison across domains. Second, queries are reasoning-intensive, with images requiring deeper interpretation such as diagnosing microscopic slides. We further introduce Contradiction Retrieval, a novel task requiring models to identify conflicting concepts. Finally, queries and documents are constructed as image-text interleaved sequences. Unlike earlier benchmarks restricted to single images or unimodal documents, MRMR offers a realistic setting with multi-image queries and mixed-modality corpus documents. We conduct an extensive evaluation of 4 categories of multimodal retrieval systems and 14 frontier models on MRMR. The text embedding model Qwen3-Embedding with LLM-generated image captions achieves the highest performance, highlighting substantial room for improving multimodal retrieval models. Although latest multimodal models such as Ops-MM-Embedding perform competitively on expert-domain queries, they fall short on reasoning-intensive tasks. We believe that MRMR paves the way for advancing multimodal retrieval in more realistic and challenging scenarios.

MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval

TL;DR

MRMR introduces a realistic, expert-level multidisciplinary benchmark for reasoning-intensive multimodal retrieval, featuring 1,502 queries across 23 domains and interleaved image–text content with expert-verified positives. It defines three tasks—Knowledge, Theorem, and Contradiction—and adds a Contradiction Retrieval dimension to probe logical reasoning, including four specialized subtasks (Negation, Vehicle Design, Traffic Case). Across 14 frontier models and four retrieval paradigms, results show text-based captioned retrieval often surpasses multimodal methods on knowledge and reasoning tasks, while multimodal models exhibit large cross-domain gaps and limited reasoning abilities. The study demonstrates the value of large-language-model-assisted data construction and test-time reasoning expansion, yielding practical guidance for advancing multimodal retrieval in realistic, high-stakes domains.

Abstract

We introduce MRMR, the first expert-level multidisciplinary multimodal retrieval benchmark requiring intensive reasoning. MRMR contains 1,502 queries spanning 23 domains, with positive documents carefully verified by human experts. Compared to prior benchmarks, MRMR introduces three key advancements. First, it challenges retrieval systems across diverse areas of expertise, enabling fine-grained model comparison across domains. Second, queries are reasoning-intensive, with images requiring deeper interpretation such as diagnosing microscopic slides. We further introduce Contradiction Retrieval, a novel task requiring models to identify conflicting concepts. Finally, queries and documents are constructed as image-text interleaved sequences. Unlike earlier benchmarks restricted to single images or unimodal documents, MRMR offers a realistic setting with multi-image queries and mixed-modality corpus documents. We conduct an extensive evaluation of 4 categories of multimodal retrieval systems and 14 frontier models on MRMR. The text embedding model Qwen3-Embedding with LLM-generated image captions achieves the highest performance, highlighting substantial room for improving multimodal retrieval models. Although latest multimodal models such as Ops-MM-Embedding perform competitively on expert-domain queries, they fall short on reasoning-intensive tasks. We believe that MRMR paves the way for advancing multimodal retrieval in more realistic and challenging scenarios.

Paper Structure

This paper contains 51 sections, 18 figures, 8 tables.

Figures (18)

  • Figure 2: An overview of the data construction workflow for MRMR132, 60, 1800, 0, 0 (Knowledge). We select and convert knowledge- and reasoning-intensive questions from MMMU-Pro mmmupro into retrieval queries. Web pages such as Wikipedia, blogs, and papers referenced by the GPT-Search model during reasoning are processed into documents through screen capturing, OCR monkeyocr, and chunking. The relevance of resulting documents is first evaluated by GPT and then verified by expert annotators. Lastly, we source negative documents from the knowledge-intensive multimodal collection PIN-14M pin to construct a sizable corpus.
  • Figure 3: Annotation Interface - Step 1: Question Understanding. Annotators are first shown the question, associated images, candidate options, the correct answer, and an AI-generated explanation. The explanation is provided to aid understanding, though annotators are informed it may be incorrect. In this step, they judge whether the given answer is correct based on their own knowledge.
  • Figure 4: Annotation Interface — Step 2: Candidate Document Evaluation. After understanding the question, annotators are instructed to review candidate documents individually and judge whether each can facilitate correctly answering the question. Documents are shown in image format, with up to eight candidates presented. Document relevance definition has been explained to annotators before the annotation process.
  • Figure 5: Annotation Interface — Step 3: Create Relevant Document. If none of the candidate documents are deemed relevant, annotators are required to search for a suitable web page and provide the gold evidence content. They are encouraged to include images from the source, and the final document is written in an interleaved image–text format.
  • Figure 6: GPT-5 prompt for cleaning the theorem content.
  • ...and 13 more figures