Table of Contents
Fetching ...

Capabilities of GPT-5 on Multimodal Medical Reasoning

Shansong Wang, Mingzhe Hu, Qiang Li, Mojtaba Safari, Xiaofeng Yang

TL;DR

This study evaluates GPT-5 as a generalist multimodal medical reasoner under a unified QA/VQA protocol, benchmarking against GPT-4o and smaller GPT-5 variants across text and imaging tasks. It demonstrates state-of-the-art performance on MedQA, MedXpertQA, MMLU-medical, USMLE, and VQA-RAD, with pronounced gains in multimodal reasoning and meaningful surpasses over human experts in controlled settings. The results suggest GPT-5 can serve as a core component for multimodal clinical decision support, though the authors caution that benchmark conditions are idealized. The work lays groundwork for integrating text, structured data, and images in real-time clinical reasoning and points to future trials and calibration for real-world deployment.

Abstract

Recent advances in large language models (LLMs) have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.26% and +26.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5's ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems.

Capabilities of GPT-5 on Multimodal Medical Reasoning

TL;DR

This study evaluates GPT-5 as a generalist multimodal medical reasoner under a unified QA/VQA protocol, benchmarking against GPT-4o and smaller GPT-5 variants across text and imaging tasks. It demonstrates state-of-the-art performance on MedQA, MedXpertQA, MMLU-medical, USMLE, and VQA-RAD, with pronounced gains in multimodal reasoning and meaningful surpasses over human experts in controlled settings. The results suggest GPT-5 can serve as a core component for multimodal clinical decision support, though the authors caution that benchmark conditions are idealized. The work lays groundwork for integrating text, structured data, and images in real-time clinical reasoning and points to future trials and calibration for real-world deployment.

Abstract

Recent advances in large language models (LLMs) have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.26% and +26.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5's ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems.

Paper Structure

This paper contains 11 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Prompting design for QA/VQA task
  • Figure 2: A prompting design sample from MedXpertQA.
  • Figure 3: GPT-5 reasoning output and final answer for MedXpertQA: case MM-1993.