Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary

Alexandru Florea; Shansong Wang; Mingzhe Hu; Qiang Li; Zach Eidex; Luke del Balzo; Mojtaba Safari; Xiaofeng Yang

Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary

Alexandru Florea, Shansong Wang, Mingzhe Hu, Qiang Li, Zach Eidex, Luke del Balzo, Mojtaba Safari, Xiaofeng Yang

TL;DR

This landscape commentary provides the first controlled, cross-sectional evaluation of the GPT-5 family against its predecessor GPT-4o, indicating that while GPT-5 represents a meaningful advance toward integrated multimodal clinical reasoning, generalist models are not yet substitutes for purpose-built systems in highly specialized, perception-critical tasks.

Abstract

The transition from task-specific artificial intelligence toward general-purpose foundation models raises fundamental questions about their capacity to support the integrated reasoning required in clinical medicine, where diagnosis demands synthesis of ambiguous patient narratives, laboratory data, and multimodal imaging. This landscape commentary provides the first controlled, cross-sectional evaluation of the GPT-5 family (GPT-5, GPT-5 Mini, GPT-5 Nano) against its predecessor GPT-4o across a diverse spectrum of clinically grounded tasks, including medical education examinations, text-based reasoning benchmarks, and visual question-answering in neuroradiology, digital pathology, and mammography using a standardized zero-shot chain-of-thought protocol. GPT-5 demonstrated substantial gains in expert-level textual reasoning, with absolute improvements exceeding 25 percentage-points on MedXpertQA. When tasked with multimodal synthesis, GPT-5 effectively leveraged this enhanced reasoning capacity to ground uncertain clinical narratives in concrete imaging evidence, achieving state-of-the-art or competitive performance across most VQA benchmarks and outperforming GPT-4o by margins of 10-40% in mammography tasks requiring fine-grained lesion characterization. However, performance remained moderate in neuroradiology (44% macro-average accuracy) and lagged behind domain-specific models in mammography, where specialized systems exceed 80% accuracy compared to GPT-5's 52-64%. These findings indicate that while GPT-5 represents a meaningful advance toward integrated multimodal clinical reasoning, mirroring the clinician's cognitive process of biasing uncertain information with objective findings, generalist models are not yet substitutes for purpose-built systems in highly specialized, perception-critical tasks.

Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary

TL;DR

Abstract

Paper Structure (17 sections, 7 figures, 1 table)

This paper contains 17 sections, 7 figures, 1 table.

Introduction
Methods
Datasets
Prompting Design
Results
Performance of GPT-5 on Medical Education Exams
Performance of GPT-5 on QA Benchmarks
Performance of GPT-5 on VQA Benchmarks
Expert-Level Medical Reasoning and Understanding
Brain Tumor MRI
Digital Pathology
Mammography
Discussion
Conclusion
Acknowledgments
...and 2 more sections

Figures (7)

Figure 1: Workflow diagram of the proposed evaluation pipeline. (Right panel) Datasets are first standardized and relevant data is extracted into the chain-of-thought (CoT) prompting design. For Question Answering (QA) tasks, only textual data is passed to the CoT dialog; for Visual Question Answering (VQA) tasks, both text and image data is provided. The correct answer used for evaluating model performance is saved from the dataset. (Left panel) Each of four LLM models---GPT-5, GPT-5 Mini, GPT-5 Nano, and GPT-4o---are evaluated under a zero-shot protocol. First, the model is anchored to the role of a medical assistant to elucidate clinical thinking and reasoning. Then, a chain-of-thought reasoning scheme is triggered via prompting, and relevant question-answering data is passed from the dataset to the model for answering. This step enables a free-flowing thought process generation without the model deciding on a final answer choice. The model's prediction rational is saved and can be examined in detail. Finally, answer choice convergence is forced via direct prompting. This final choice is compared to the correct answer to assess performance accuracy (bottom right). Each model is exposed to the same inputs. (Middle panel) Chain-of-thought dialog example. User input is shown in blue, while LLM output is shown in grey.
Figure 2: Zero-shot chain-of-thought prompting template for text-only question-answering (QA) tasks, consisting of an initial rationale-generation turn followed by a constrained, single-letter answer selection.
Figure 3: Zero-shot chain-of-thought prompting template for multimodal visual question-answering (VQA) tasks. An initial rationale-generation turn is accompanied with a piece of imaging evidence and followed by a constrained, single-letter answer selection.
Figure A1: Detailed prompting design template for a representative question taken from the MedXpertQA dataset (Case: MM-1993). The associated image is provided first, followed by the question and labeled answer choices. The model outputs an intermediate reasoning step (ASSISTANT_RATIONALE) and then a final letter (ASSISTANT_FINAL) corresponding to its chosen answer. Accuracy was computed solely from ASSISTANT_FINAL.
Figure A2: Output of GPT-5 intermediate reasoning step stored in the ASSISTANT_RATIONALE variable and final answer for MedXpertQA case MM-1993.
...and 2 more figures

Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary

TL;DR

Abstract

Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary

Authors

TL;DR

Abstract

Table of Contents

Figures (7)