Table of Contents
Fetching ...

A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

Yunfei Xie, Juncheng Wu, Haoqin Tu, Siwei Yang, Bingchen Zhao, Yongshuo Zong, Qiao Jin, Cihang Xie, Yuyin Zhou

TL;DR

OpenAI o1, a chain-of-thought–enhanced LLM trained with reinforcement learning, is systematically evaluated on 37 medical datasets (including 2 novel QA datasets from NEJM and The Lancet) across three core capabilities: understanding, reasoning, and multilinguality. The study finds that o1’s enhanced reasoning transfers effectively to clinical understanding and diagnostic reasoning, achieving an average accuracy of $74.3\%$ across 19 datasets and outperforming GPT-4 on many medical benchmarks. However, hallucination, multilingual reasoning challenges, and metric biases persist, indicating that no single model dominates all medical tasks. The work highlights the need for robust evaluation protocols, reliable prompting strategies, and future research to realize a safe, effective AI clinician capable of handling complex multilingual clinical scenarios.

Abstract

Large language models (LLMs) have exhibited remarkable capabilities across various domains and tasks, pushing the boundaries of our knowledge in learning and cognition. The latest model, OpenAI's o1, stands out as the first LLM with an internalized chain-of-thought technique using reinforcement learning strategies. While it has demonstrated surprisingly strong capabilities on various general language tasks, its performance in specialized fields such as medicine remains unknown. To this end, this report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality. Specifically, our evaluation encompasses 6 tasks using data from 37 medical datasets, including two newly constructed and more challenging question-answering (QA) tasks based on professional medical quizzes from the New England Journal of Medicine (NEJM) and The Lancet. These datasets offer greater clinical relevance compared to standard medical QA benchmarks such as MedQA, translating more effectively into real-world clinical utility. Our analysis of o1 suggests that the enhanced reasoning ability of LLMs may (significantly) benefit their capability to understand various medical instructions and reason through complex clinical scenarios. Notably, o1 surpasses the previous GPT-4 in accuracy by an average of 6.2% and 6.6% across 19 datasets and two newly created complex QA scenarios. But meanwhile, we identify several weaknesses in both the model capability and the existing evaluation protocols, including hallucination, inconsistent multilingual ability, and discrepant metrics for evaluation. We release our raw data and model outputs at https://ucsc-vlaa.github.io/o1_medicine/ for future research.

A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

TL;DR

OpenAI o1, a chain-of-thought–enhanced LLM trained with reinforcement learning, is systematically evaluated on 37 medical datasets (including 2 novel QA datasets from NEJM and The Lancet) across three core capabilities: understanding, reasoning, and multilinguality. The study finds that o1’s enhanced reasoning transfers effectively to clinical understanding and diagnostic reasoning, achieving an average accuracy of across 19 datasets and outperforming GPT-4 on many medical benchmarks. However, hallucination, multilingual reasoning challenges, and metric biases persist, indicating that no single model dominates all medical tasks. The work highlights the need for robust evaluation protocols, reliable prompting strategies, and future research to realize a safe, effective AI clinician capable of handling complex multilingual clinical scenarios.

Abstract

Large language models (LLMs) have exhibited remarkable capabilities across various domains and tasks, pushing the boundaries of our knowledge in learning and cognition. The latest model, OpenAI's o1, stands out as the first LLM with an internalized chain-of-thought technique using reinforcement learning strategies. While it has demonstrated surprisingly strong capabilities on various general language tasks, its performance in specialized fields such as medicine remains unknown. To this end, this report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality. Specifically, our evaluation encompasses 6 tasks using data from 37 medical datasets, including two newly constructed and more challenging question-answering (QA) tasks based on professional medical quizzes from the New England Journal of Medicine (NEJM) and The Lancet. These datasets offer greater clinical relevance compared to standard medical QA benchmarks such as MedQA, translating more effectively into real-world clinical utility. Our analysis of o1 suggests that the enhanced reasoning ability of LLMs may (significantly) benefit their capability to understand various medical instructions and reason through complex clinical scenarios. Notably, o1 surpasses the previous GPT-4 in accuracy by an average of 6.2% and 6.6% across 19 datasets and two newly created complex QA scenarios. But meanwhile, we identify several weaknesses in both the model capability and the existing evaluation protocols, including hallucination, inconsistent multilingual ability, and discrepant metrics for evaluation. We release our raw data and model outputs at https://ucsc-vlaa.github.io/o1_medicine/ for future research.
Paper Structure (40 sections, 7 figures, 10 tables)

This paper contains 40 sections, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Overall results of o1 and other 4 strong LLMs. We show performance on 12 medical datasets spanning diverse domains. o1 demonstrates a clear performance advantage over close- and open-source models.
  • Figure 2: Average accuracy of o1 and other 4 strong LLMs. o1 achieves the highest average accuracy of 74.3% across 19 medical datasets.
  • Figure 3: Our evaluation pipeline has different (a) aspects with various (b) prompting strategies using the latest (c) language models. We leverage a comprehensive set of (d) evaluations to present a holistic view of model progress in the medical domain.
  • Figure 4: Answers from o1 and GPT-4 on a question from LancetQA. o1 provides a more concise and accurate reasoning process compared to GPT-4.
  • Figure 5: Failure case of o1 on AI Hospital. The model struggles with generating the right diagnosis and outputs mixed-language, resulting to its suboptimal performance in this context.
  • ...and 2 more figures