Table of Contents
Fetching ...

Towards a Medical AI Scientist

Hongtao Wu, Boyun Zheng, Dingjie Song, Yu Jiang, Jianfeng Gao, Lei Xing, Lichao Sun, Yixuan Yuan

Abstract

Autonomous systems that generate scientific hypotheses, conduct experiments, and draft manuscripts have recently emerged as a promising paradigm for accelerating discovery. However, existing AI Scientists remain largely domain-agnostic, limiting their applicability to clinical medicine, where research is required to be grounded in medical evidence with specialized data modalities. In this work, we introduce Medical AI Scientist, the first autonomous research framework tailored to clinical autonomous research. It enables clinically grounded ideation by transforming extensively surveyed literature into actionable evidence through clinician-engineer co-reasoning mechanism, which improves the traceability of generated research ideas. It further facilitates evidence-grounded manuscript drafting guided by structured medical compositional conventions and ethical policies. The framework operates under 3 research modes, namely paper-based reproduction, literature-inspired innovation, and task-driven exploration, each corresponding to a distinct level of automated scientific inquiry with progressively increasing autonomy. Comprehensive evaluations by both large language models and human experts demonstrate that the ideas generated by the Medical AI Scientist are of substantially higher quality than those produced by commercial LLMs across 171 cases, 19 clinical tasks, and 6 data modalities. Meanwhile, our system achieves strong alignment between the proposed method and its implementation, while also demonstrating significantly higher success rates in executable experiments. Double-blind evaluations by human experts and the Stanford Agentic Reviewer suggest that the generated manuscripts approach MICCAI-level quality, while consistently surpassing those from ISBI and BIBM. The proposed Medical AI Scientist highlights the potential of leveraging AI for autonomous scientific discovery in healthcare.

Towards a Medical AI Scientist

Abstract

Autonomous systems that generate scientific hypotheses, conduct experiments, and draft manuscripts have recently emerged as a promising paradigm for accelerating discovery. However, existing AI Scientists remain largely domain-agnostic, limiting their applicability to clinical medicine, where research is required to be grounded in medical evidence with specialized data modalities. In this work, we introduce Medical AI Scientist, the first autonomous research framework tailored to clinical autonomous research. It enables clinically grounded ideation by transforming extensively surveyed literature into actionable evidence through clinician-engineer co-reasoning mechanism, which improves the traceability of generated research ideas. It further facilitates evidence-grounded manuscript drafting guided by structured medical compositional conventions and ethical policies. The framework operates under 3 research modes, namely paper-based reproduction, literature-inspired innovation, and task-driven exploration, each corresponding to a distinct level of automated scientific inquiry with progressively increasing autonomy. Comprehensive evaluations by both large language models and human experts demonstrate that the ideas generated by the Medical AI Scientist are of substantially higher quality than those produced by commercial LLMs across 171 cases, 19 clinical tasks, and 6 data modalities. Meanwhile, our system achieves strong alignment between the proposed method and its implementation, while also demonstrating significantly higher success rates in executable experiments. Double-blind evaluations by human experts and the Stanford Agentic Reviewer suggest that the generated manuscripts approach MICCAI-level quality, while consistently surpassing those from ISBI and BIBM. The proposed Medical AI Scientist highlights the potential of leveraging AI for autonomous scientific discovery in healthcare.

Paper Structure

This paper contains 26 sections, 8 figures.

Figures (8)

  • Figure 1: a, System workflow: fully-automated multi-agents system for end-to-end scientific discovery in clinical medicine. b, Med-AI Bench: visualization depicting 19 distinct medical research tasks within performance benchmarking. c, Experimental setup: comparative evaluation across Idea generation, execution and full paper compilation in the research lifecycle. d, Performance benchmarking: comparable manuscript quality to representative works from leading venues.
  • Figure 2: Medical AI Scientist surpasses commercial LLMs in idea generation under combined LLM-based and blinded human evaluation. Models generated research ideas that were anonymized and assessed by three independent experts using a five point scale. a, LLM based evaluation of idea quality. b, Quantitative human assessment across six evaluation criteria. c, Qualitative human analysis of strengths and limitations relative to commercial LLMs.
  • Figure 3: Example of idea generation comparison between Medical AI Scientist and commercial LLMs in the Literature-inspired Innovation Mode.
  • Figure 4: Comparative evaluation of Medical AI Scientist frameworks against commercial LLMs in terms of implementation completeness and experimental success rate. a, Implementation completeness was assessed on a five point scale ranging from 1 to 5. Model generated outputs were anonymized and independently evaluated by two LLM-based judges. b, Experimental success rate measured through quantitative human evaluation.
  • Figure 5: Anonymized comparison of paper quality on an identical medical task. Manuscripts generated by Medical AI Scientist achieve performance comparable to MICCAI, ISBI, and BIBM under consistent double-blind evaluation across both quantitative and qualitative assessments: a, Stanford Agentic Reviewer automatic evaluation. b, Double-blinded scoring (1–5) by 10 medical experts (PhD/postdoc) across five review dimensions. c, Experts’ observations on strengths and limitations.
  • ...and 3 more figures