Table of Contents
Fetching ...

Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent

Zhihan Guo, Feiyang Xu, Yifan Li, Muzhi Li, Shuai Zou, Jiele Wu, Han Shi, Haoli Bai, Ho-fung Leung, Irwin King

TL;DR

<3-5 sentence high-level summary> Dr.Mi-Bench introduces a modular-integrated benchmark and Dr.Mi-Eval framework to evaluate scientific deep research agents across planning, retrieval, and reasoning in 10 domains with 200 instances. The approach provides end-to-end and isolated modes to diagnose module-level capabilities and backbone dependencies, revealing planning as a key bottleneck and domain-dependent retrieval/ reasoning challenges. Empirical results with state-of-the-art DR systems and diverse LLM backbones show specialization among models (e.g., Grok for retrieval, Gemini for reasoning) but widespread weaknesses in multi-source retrieval and cross-domain robustness, offering actionable diagnostics for improving academic research assistants. The work emphasizes a structured, annotation-driven evaluation to align automation with human preferences and to guide future, more reliable, domain-aware DR agents.

Abstract

The explosive growth in academic literature necessitates automated deep research (DR) agents, yet their evaluation remains a significant challenge. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the scientific domains that are the core application for DR agents. To address these gaps, we introduce Dr.Mi-Bench, a Modular-integrated benchmark for scientific DR agents. Grounded in academic literature, our benchmark uses a human-annotated dataset of 200 instances across 10 scientific domains, including both research and review papers. Besides, we also propose a Modular-integrated Evaluation Paradigm for DR Agents (Dr.Mi-Eval), a novel modular-integrated evaluation paradigm, which leverages the rich structure of academic papers to assess the core competencies of planning, retrieval, and reasoning through two complementary modes: an end-to-end evaluation for DR agents and an isolated evaluation for foundational LLMs as potential backbones. Experimental results reveal a fragmented performance landscape: agents exhibit specialized strengths but share critical weaknesses, most notably in performing the multi-source retrieval required for review-style tasks and performing consistently across diverse scientific fields. Moreover, improving high-level planning capability is the crucial factor for unlocking the reasoning potential of foundational LLMs as backbones. By exposing these actionable failure modes, Dr.Mi-Bench provides a diagnostic tool to guide the development of more reliable academic research assistants.

Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent

TL;DR

<3-5 sentence high-level summary> Dr.Mi-Bench introduces a modular-integrated benchmark and Dr.Mi-Eval framework to evaluate scientific deep research agents across planning, retrieval, and reasoning in 10 domains with 200 instances. The approach provides end-to-end and isolated modes to diagnose module-level capabilities and backbone dependencies, revealing planning as a key bottleneck and domain-dependent retrieval/ reasoning challenges. Empirical results with state-of-the-art DR systems and diverse LLM backbones show specialization among models (e.g., Grok for retrieval, Gemini for reasoning) but widespread weaknesses in multi-source retrieval and cross-domain robustness, offering actionable diagnostics for improving academic research assistants. The work emphasizes a structured, annotation-driven evaluation to align automation with human preferences and to guide future, more reliable, domain-aware DR agents.

Abstract

The explosive growth in academic literature necessitates automated deep research (DR) agents, yet their evaluation remains a significant challenge. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the scientific domains that are the core application for DR agents. To address these gaps, we introduce Dr.Mi-Bench, a Modular-integrated benchmark for scientific DR agents. Grounded in academic literature, our benchmark uses a human-annotated dataset of 200 instances across 10 scientific domains, including both research and review papers. Besides, we also propose a Modular-integrated Evaluation Paradigm for DR Agents (Dr.Mi-Eval), a novel modular-integrated evaluation paradigm, which leverages the rich structure of academic papers to assess the core competencies of planning, retrieval, and reasoning through two complementary modes: an end-to-end evaluation for DR agents and an isolated evaluation for foundational LLMs as potential backbones. Experimental results reveal a fragmented performance landscape: agents exhibit specialized strengths but share critical weaknesses, most notably in performing the multi-source retrieval required for review-style tasks and performing consistently across diverse scientific fields. Moreover, improving high-level planning capability is the crucial factor for unlocking the reasoning potential of foundational LLMs as backbones. By exposing these actionable failure modes, Dr.Mi-Bench provides a diagnostic tool to guide the development of more reliable academic research assistants.

Paper Structure

This paper contains 75 sections, 14 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: The Dr.Mi-Bench and Dr.Mi-Eval Framework. This figure illustrates how a DR agent workflow (a) is assessed by our modular-integrated evaluation paradigm (b) using our human-annotated benchmark (c). This approach enables a diagnostic analysis that pinpoints specific failures in the agent's planning, retrieval, and reasoning capabilities.
  • Figure 2: End-to-end DR agents performance across diverse domains: CS (Computer Science), FIN (Finance), MED (Medicine), BIO (Biology), CHEM (Chemistry), ENV (Environmental Science), NRG (Energy), BC (Building and Construction), EARTH (Earth Science), and MAT (Materials).All results are presented as percentages.
  • Figure 3: Efficiency of retrieval and reasoning modules. All results are presented as percentages.
  • Figure 4: Performance scaling with test-time compute.
  • Figure 5: Foundational LLMs performance across diverse domains. The heatmap visualizes the accuracy scores of various models (rows) across 10 scientific and technical domains (columns), where color intensity corresponds to performance. The domains are: CS (Computer Science), FIN (Finance), MED (Medicine), BIO (Biology), CHEM (Chemistry), ENV (Environmental Science), NRG (Energy), BC (Building and Construction), EARTH (Earth Science), and MAT (Materials).

Theorems & Definitions (9)

  • Definition 1: Deep Research Agent
  • Definition 2: Modular-Integrated Evaluation Paradigm
  • Definition 3: Planner $\pi$
  • Definition 4: Retriever $\rho$
  • Definition 5: Reasoner $\sigma$
  • Definition 6: True Positives
  • Definition 7: True Negatives
  • Definition 8: False Positives
  • Definition 9: False Negatives