Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent

Zhihan Guo; Feiyang Xu; Yifan Li; Muzhi Li; Shuai Zou; Jiele Wu; Han Shi; Haoli Bai; Ho-fung Leung; Irwin King

Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent

Zhihan Guo, Feiyang Xu, Yifan Li, Muzhi Li, Shuai Zou, Jiele Wu, Han Shi, Haoli Bai, Ho-fung Leung, Irwin King

TL;DR

<3-5 sentence high-level summary> Dr.Mi-Bench introduces a modular-integrated benchmark and Dr.Mi-Eval framework to evaluate scientific deep research agents across planning, retrieval, and reasoning in 10 domains with 200 instances. The approach provides end-to-end and isolated modes to diagnose module-level capabilities and backbone dependencies, revealing planning as a key bottleneck and domain-dependent retrieval/ reasoning challenges. Empirical results with state-of-the-art DR systems and diverse LLM backbones show specialization among models (e.g., Grok for retrieval, Gemini for reasoning) but widespread weaknesses in multi-source retrieval and cross-domain robustness, offering actionable diagnostics for improving academic research assistants. The work emphasizes a structured, annotation-driven evaluation to align automation with human preferences and to guide future, more reliable, domain-aware DR agents.

Abstract

The explosive growth in academic literature necessitates automated deep research (DR) agents, yet their evaluation remains a significant challenge. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the scientific domains that are the core application for DR agents. To address these gaps, we introduce Dr.Mi-Bench, a Modular-integrated benchmark for scientific DR agents. Grounded in academic literature, our benchmark uses a human-annotated dataset of 200 instances across 10 scientific domains, including both research and review papers. Besides, we also propose a Modular-integrated Evaluation Paradigm for DR Agents (Dr.Mi-Eval), a novel modular-integrated evaluation paradigm, which leverages the rich structure of academic papers to assess the core competencies of planning, retrieval, and reasoning through two complementary modes: an end-to-end evaluation for DR agents and an isolated evaluation for foundational LLMs as potential backbones. Experimental results reveal a fragmented performance landscape: agents exhibit specialized strengths but share critical weaknesses, most notably in performing the multi-source retrieval required for review-style tasks and performing consistently across diverse scientific fields. Moreover, improving high-level planning capability is the crucial factor for unlocking the reasoning potential of foundational LLMs as backbones. By exposing these actionable failure modes, Dr.Mi-Bench provides a diagnostic tool to guide the development of more reliable academic research assistants.

Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent

TL;DR

Abstract

Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)

Theorems & Definitions (9)