Evidence-based diagnostic reasoning with multi-agent copilot for human pathology

Luca L. Weishaupt; Chengkuan Chen; Drew F. K. Williamson; Richard J. Chen; Guillaume Jaume; Tong Ding; Bowen Chen; Anurag Vaidya; Long Phi Le; Guillaume Jaume; Ming Y. Lu; Faisal Mahmood

Evidence-based diagnostic reasoning with multi-agent copilot for human pathology

Luca L. Weishaupt, Chengkuan Chen, Drew F. K. Williamson, Richard J. Chen, Guillaume Jaume, Tong Ding, Bowen Chen, Anurag Vaidya, Long Phi Le, Guillaume Jaume, Ming Y. Lu, Faisal Mahmood

Abstract

Pathology is experiencing rapid digital transformation driven by whole-slide imaging and artificial intelligence (AI). While deep learning-based computational pathology has achieved notable success, traditional models primarily focus on image analysis without integrating natural language instruction or rich, text-based context. Current multimodal large language models (MLLMs) in computational pathology face limitations, including insufficient training data, inadequate support and evaluation for multi-image understanding, and a lack of autonomous, diagnostic reasoning capabilities. To address these limitations, we introduce PathChat+, a new MLLM specifically designed for human pathology, trained on over 1 million diverse, pathology-specific instruction samples and nearly 5.5 million question answer turns. Extensive evaluations across diverse pathology benchmarks demonstrated that PathChat+ substantially outperforms the prior PathChat copilot, as well as both state-of-the-art (SOTA) general-purpose and other pathology-specific models. Furthermore, we present SlideSeek, a reasoning-enabled multi-agent AI system leveraging PathChat+ to autonomously evaluate gigapixel whole-slide images (WSIs) through iterative, hierarchical diagnostic reasoning, reaching high accuracy on DDxBench, a challenging open-ended differential diagnosis benchmark, while also capable of generating visually grounded, humanly-interpretable summary reports.

Evidence-based diagnostic reasoning with multi-agent copilot for human pathology

Abstract

Paper Structure

This paper contains 1 section, 6 figures, 14 tables, 1 algorithm.

Figures (6)

Figure 1: Overview of SlideSeek and PathChat+ for slide-level diagnosis and report generation.A. Our multi-agent-based AI system, SlideSeek, starts with a standardized task description to reach a slide diagnosis autonomously. A reasoning large language model (LLM) serves as supervisor, continually tracking progress, refining diagnostic plans, and choosing additional regions for morphological examination. During each planning iteration, the supervisor instructs a team of specialized pathologist agents, each of which interacts with PathChat+ to analyze specific regions and report findings back to the supervisor. This iterative, hierarchical workflow continues until the supervisor agent determines sufficient evidence has been collected to establish a well-supported differential diagnosis. A separate report agent synthesizes the morphological evidence from critical ROIs into an interpretable, visually grounded diagnostic summary report. See Extended Data \ref{['fig:SlideSeekArch']} for a step-by-step view of the supervisor–explorer workflow. B. Example of PathChat+ that can intake one or multiple regions of interest (ROIs) with text instructions to provide morphologically-grounded tissue description. C. PathChat+ is trained with instruction finetuning on 1.13 million instructions and 5.49 million question-answer turns, based on 624 thousand unique images. A distribution of images is provided by tissue site, staining, and disease category.
Figure 2: ROI-level benchmarking of PathChat+ versus multimodal LLM baselines.A.-B. Visual question answering (VQA): A. example ROI with prompt and answer; B. accuracy on PathMMU subsets (Atlas, EduContent, PathCLS, SocialPath, PubMed) and PathQABench MCQ. C.-D. Image classification: C. example multiple-choice prompt and ground truth; D. accuracy on BRACS, UniToPatho, and HiCervix. E.-F. Captioning: E. example model-generated morphological description; F. METEOR score on PathQABench-Caption. Bars report, for each family, the strongest closed-source general model (“Best General (Closed)”), strongest open-source general model (“Best General (Open)”), and strongest medical-specialized model (“Best Specific (Open)”), alongside PathChat 1 and PathChat+ (legend). Full per-model results are in Extended Data \ref{['tab:PathMMU_all', 'tab:PathMMU_Atlas', 'tab:PathMMU_EduContent', 'tab:PathMMU_PathCLS', 'tab:PathMMU_SocialPath', 'tab:PathMMU_PubMed', 'tab:becnchmark_PathQABench', 'tab:benchmark_classification', 'tab:becnchmark_PathQABenchCaption']}. Error bars denote 95% confidence intervals from non-parametric bootstrapping. In B., D., and F., statistical significance was determined between PathChat+ compared to all other models using a paired two-sided permutation test ($n=1000$). The p-values are indicated as $p<0.05$: *, $p<0.01$: **, $p<0.001$: ***. Statistical
Figure 3: Performance of SlideSeek and PathChat+ on DDxBench for open-ended differential diagnosis from whole-slide images. Each prediction was manually assessed by a board-certified anatomic pathologist who compared the model's predictions against the assigned ground truth diagnosis and report. Model performance is measured based on using top-1 (Primary Diagnosis) and top-3 (Primary + Differential diagnoses). A. Distribution of cases in DDxBench (N=150 cases) based on the tissue they were sampled from and the disease rarity based on incidence (rare: $<6/10^5/$year, common otherwise). B. SlideSeek performance stratified by disease rarity. C. Multimodal LLM performance on DDxBench with pre-selected ROIs. 10 expert-curated ROIs from each slide and a prompt to provide a primary and two additional differential diagnoses were provided to various multimodal LLMs. Extended Data Figure \ref{['fig:MLLM-baseline']} illustrates this experiment and the exact prompt used. D.-E. Ablation study varying the agentic configuration of SlideSeek, where the reasoning agent (GPT-5-mini) is swapped by a non-reasoning agent (GPT-4.1), and removing the agent supervisor that is replaced by a single agent. F.-G. Ablation study varying the captioning model used in SlideSeek, replacing PathChat+ by PathChat 1 or using a general-purpose multimodal LLM (GPT-5-mini). Error bars represent 95% confidence intervals using non-parametric bootstrapping. Statistical significance was obtained in B. using an unpaired two-sided permutation test, doing pairwise comparison across the three sets. In C,E and G statistical significance was determined between our model (PathChat+ or SlideSeek) compared to all other models and ablations using a paired two-sided permutation test ($n=1000$). P-values are indicated as $p<0.05$: *, $p<0.01$: **, $p<0.001$: ***.
Figure 4: Example of SlideSeek on DDxBench. Diagnosis trace of a follicular thyroid carcinoma case illustrating the interaction between the supervisor and explorer agents. The supervisor receives a system prompt containing a low-resolution thumbnail, clinical information, and task instructions. Based on successive observations, the supervisor generates focused tasks (e.g., "Task 1", "Task 2") for explorer agents. Explorer agents then examine designated regions at specified magnifications and describe tissue morphology with the assistance of PathChat+. Their individual findings (e.g., "Report 1", "Report 2") are submitted back to the supervisor. In this example, although most tissue appears benign, explorer agents identify invasive carcinoma cells at high magnification. Ten regions of interest (ROIs) are evaluated with PathChat+ to support the primary and differential diagnosis, which the supervisor synthesizes into a final report.
Figure 5: Prompts, outputs, and tools used by SlideSeek. The supervisor receives a task prompt, slide thumbnail, and auto-generated slide description (dimensions and tissue bounding boxes), then iteratively updates hypotheses, a plan, current step, explorer tasks, justifications, and a finished flag. Explorers receive the same slide context, plus an assigned task, and navigate around in specified regions and magnifications. They iteratively choose coordinates to obtain morphological descriptions of regions of interest (ROIs) using PathChat+ or submit a concise report to the supervisor, ending their exploration. The supervisor-explorer loop continues until the supervisor deems sufficient evidence is collected, after which the supervisor selects up to 10 diagnostically relevant ROIs, requests a PathChat+ differential, and composes a visually grounded report with primary and two differential diagnoses and a confidence assessment.
...and 1 more figures