MedBrowseComp: Benchmarking Medical Deep Research and Computer Use

Shan Chen; Pedro Moreira; Yuxin Xiao; Sam Schmidgall; Jeremy Warner; Hugo Aerts; Thomas Hartvigsen; Jack Gallifant; Danielle S. Bitterman

MedBrowseComp: Benchmarking Medical Deep Research and Computer Use

Shan Chen, Pedro Moreira, Yuxin Xiao, Sam Schmidgall, Jeremy Warner, Hugo Aerts, Thomas Hartvigsen, Jack Gallifant, Danielle S. Bitterman

TL;DR

MedBrowseComp introduces a medical web-browsing benchmark to evaluate AI agents on real-time, multi-hop information retrieval across heterogeneous sources. It provides a dataset of over 1,000 clinically meaningful questions curated from HemOnc.org and requires agents to locate verifiable medical facts from live sources. Evaluations reveal substantial capability gaps: frontier models struggle with multi-hop medical queries, often achieving well below 50% accuracy, with dramatically lower performance on the hardest tasks. The work establishes a rigorous testbed and baseline results for both deep-research and GUI-based computer-use agents, and outlines concrete directions for improving planning, tool use, grounding, and deployment in real-world medical settings.

Abstract

Large language models (LLMs) are increasingly envisioned as decision-support tools in clinical practice, yet safe clinical reasoning demands integrating heterogeneous knowledge bases -- trials, primary studies, regulatory documents, and cost data -- under strict accuracy constraints. Existing evaluations often rely on synthetic prompts, reduce the task to single-hop factoid queries, or conflate reasoning with open-ended generation, leaving their real-world utility unclear. To close this gap, we present MedBrowseComp, the first benchmark that systematically tests an agent's ability to reliably retrieve and synthesize multi-hop medical facts from live, domain-specific knowledge bases. MedBrowseComp contains more than 1,000 human-curated questions that mirror clinical scenarios where practitioners must reconcile fragmented or conflicting information to reach an up-to-date conclusion. Applying MedBrowseComp to frontier agentic systems reveals performance shortfalls as low as ten percent, exposing a critical gap between current LLM capabilities and the rigor demanded in clinical settings. MedBrowseComp therefore offers a clear testbed for reliable medical information seeking and sets concrete goals for future model and toolchain upgrades. You can visit our project page at: https://moreirap12.github.io/mbc-browse-app/

MedBrowseComp: Benchmarking Medical Deep Research and Computer Use

TL;DR

Abstract

MedBrowseComp: Benchmarking Medical Deep Research and Computer Use

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)