Table of Contents
Fetching ...

Towards Next-Generation Medical Agent: How o1 is Reshaping Decision-Making in Medical Scenarios

Shaochen Xu, Yifan Zhou, Zhengliang Liu, Zihao Wu, Tianyang Zhong, Huaqin Zhao, Yiwei Li, Hanqi Jiang, Yi Pan, Junhao Chen, Jin Lu, Wei Zhang, Tuo Zhang, Lu Zhang, Dajiang Zhu, Xiang Li, Wei Liu, Quanzheng Li, Andrea Sikora, Xiaoming Zhai, Zhen Xiang, Tianming Liu

TL;DR

This paper investigates replacing GPT-4 with the o1 backbone in three medical LLM-agent frameworks (CoD Agent, MedAgents, and AgentClinic) to enhance multi-step medical reasoning, tool use, and real-time information retrieval. Across DxBench, Dxy, Muzhi, MedQA, MedMCQA, and NEJM datasets, o1 improves diagnostic accuracy and consistency, though at the cost of higher runtimes and occasional limitations on simple tasks or multimodal data. The results support the potential of o1-driven medical agents to approach human-like diagnostic reasoning in dynamic clinical environments and motivate future multimodal, efficiency-optimized multi-agent systems. The findings underscore the importance of backbone selection for medical agents and suggest a path toward smarter, more responsive AI-assisted clinical decision-making with real-world impact.

Abstract

Artificial Intelligence (AI) has become essential in modern healthcare, with large language models (LLMs) offering promising advances in clinical decision-making. Traditional model-based approaches, including those leveraging in-context demonstrations and those with specialized medical fine-tuning, have demonstrated strong performance in medical language processing but struggle with real-time adaptability, multi-step reasoning, and handling complex medical tasks. Agent-based AI systems address these limitations by incorporating reasoning traces, tool selection based on context, knowledge retrieval, and both short- and long-term memory. These additional features enable the medical AI agent to handle complex medical scenarios where decision-making should be built on real-time interaction with the environment. Therefore, unlike conventional model-based approaches that treat medical queries as isolated questions, medical AI agents approach them as complex tasks and behave more like human doctors. In this paper, we study the choice of the backbone LLM for medical AI agents, which is the foundation for the agent's overall reasoning and action generation. In particular, we consider the emergent o1 model and examine its impact on agents' reasoning, tool-use adaptability, and real-time information retrieval across diverse clinical scenarios, including high-stakes settings such as intensive care units (ICUs). Our findings demonstrate o1's ability to enhance diagnostic accuracy and consistency, paving the way for smarter, more responsive AI tools that support better patient outcomes and decision-making efficacy in clinical practice.

Towards Next-Generation Medical Agent: How o1 is Reshaping Decision-Making in Medical Scenarios

TL;DR

This paper investigates replacing GPT-4 with the o1 backbone in three medical LLM-agent frameworks (CoD Agent, MedAgents, and AgentClinic) to enhance multi-step medical reasoning, tool use, and real-time information retrieval. Across DxBench, Dxy, Muzhi, MedQA, MedMCQA, and NEJM datasets, o1 improves diagnostic accuracy and consistency, though at the cost of higher runtimes and occasional limitations on simple tasks or multimodal data. The results support the potential of o1-driven medical agents to approach human-like diagnostic reasoning in dynamic clinical environments and motivate future multimodal, efficiency-optimized multi-agent systems. The findings underscore the importance of backbone selection for medical agents and suggest a path toward smarter, more responsive AI-assisted clinical decision-making with real-world impact.

Abstract

Artificial Intelligence (AI) has become essential in modern healthcare, with large language models (LLMs) offering promising advances in clinical decision-making. Traditional model-based approaches, including those leveraging in-context demonstrations and those with specialized medical fine-tuning, have demonstrated strong performance in medical language processing but struggle with real-time adaptability, multi-step reasoning, and handling complex medical tasks. Agent-based AI systems address these limitations by incorporating reasoning traces, tool selection based on context, knowledge retrieval, and both short- and long-term memory. These additional features enable the medical AI agent to handle complex medical scenarios where decision-making should be built on real-time interaction with the environment. Therefore, unlike conventional model-based approaches that treat medical queries as isolated questions, medical AI agents approach them as complex tasks and behave more like human doctors. In this paper, we study the choice of the backbone LLM for medical AI agents, which is the foundation for the agent's overall reasoning and action generation. In particular, we consider the emergent o1 model and examine its impact on agents' reasoning, tool-use adaptability, and real-time information retrieval across diverse clinical scenarios, including high-stakes settings such as intensive care units (ICUs). Our findings demonstrate o1's ability to enhance diagnostic accuracy and consistency, paving the way for smarter, more responsive AI tools that support better patient outcomes and decision-making efficacy in clinical practice.

Paper Structure

This paper contains 23 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of our modified CoDAgent: Based on the input Symptoms and the Candidate Diseases, the LLM backbone model will make disease diagnosis.
  • Figure 2: Overview of the MedAgents pipeline with five stages: 1) expert recruitment that assigns a group of different experts related to the clinical question, 2) analysis proposition where each expert creates an analysis according to the clinical question based on its assigned role, 3) report summarization where the analysis by all recruited experts are aggregated into one, 4) collaborative consultation that facilitates expert review and iterative modifications until consensus is reached on a final report, 5) final decision where a clinical decision is made based on the summary report.
  • Figure 3: Overview of the AgentClinic pipeline, the interaction between four core agents—Measurement, Doctor, Patient, and Moderator—each supported by an LLM backbone. The doctor agent engages with the patient agent to gather symptoms and requests additional data (e.g., X-rays) from the measurement agent to aid in diagnosis. Once a diagnosis is reached, the moderator agent verifies it against standard care practices, ensuring clinical accuracy and compliance.