Table of Contents
Fetching ...

Radiology's Last Exam (RadLE): Benchmarking Frontier Multimodal AI Against Human Experts and a Taxonomy of Visual Reasoning Errors in Radiology

Suvrankar Datta, Divya Buchireddygari, Lakshmi Vennela Chowdary Kaza, Mrudula Bhalke, Kautik Singh, Ayush Pandey, Sonit Sai Vasipalli, Upasana Karnwal, Hakikat Bir Singh Bhatti, Bhavya Ratan Maroo, Sanjana Hebbar, Rahul Joseph, Gurkawal Kaur, Devyani Singh, Akhil V, Dheeksha Devasya Shama Prasad, Nishtha Mahajan, Ayinaparthi Arisha, Rajesh Vanagundi, Reet Nandy, Kartik Vuthoo, Snigdhaa Rajvanshi, Nikhileswar Kondaveeti, Suyash Gunjal, Rishabh Jain, Rajat Jain, Anurag Agrawal

TL;DR

RadLE v1 benchmarks frontier generalist multimodal AI against radiologists and trainees on 50 expert-level radiology spot-diagnosis cases across multiple imaging modalities. The study combines web-interface evaluations with API-based GPT-5 reasoning tests and introduces a qualitative taxonomy of visual reasoning errors to analyze failure modes. Results show radiologists outperform all AI models (83% vs 30% best), with substantial human–AI gaps, and reveal limited benefits from extended AI reasoning due to higher latency and inconsistent gains. The authors provide a practical error taxonomy to understand AI failures, discuss methodological and safety implications, and call for regulated, domain-specific evaluation to guide robust model development.

Abstract

Generalist multimodal AI systems such as large language models (LLMs) and vision language models (VLMs) are increasingly accessed by clinicians and patients alike for medical image interpretation through widely available consumer-facing chatbots. Most evaluations claiming expert level performance are on public datasets containing common pathologies. Rigorous evaluation of frontier models on difficult diagnostic cases remains limited. We developed a pilot benchmark of 50 expert-level "spot diagnosis" cases across multiple imaging modalities to evaluate the performance of frontier AI models against board-certified radiologists and radiology trainees. To mirror real-world usage, the reasoning modes of five popular frontier AI models were tested through their native web interfaces, viz. OpenAI o3, OpenAI GPT-5, Gemini 2.5 Pro, Grok-4, and Claude Opus 4.1. Accuracy was scored by blinded experts, and reproducibility was assessed across three independent runs. GPT-5 was additionally evaluated across various reasoning modes. Reasoning quality errors were assessed and a taxonomy of visual reasoning errors was defined. Board-certified radiologists achieved the highest diagnostic accuracy (83%), outperforming trainees (45%) and all AI models (best performance shown by GPT-5: 30%). Reliability was substantial for GPT-5 and o3, moderate for Gemini 2.5 Pro and Grok-4, and poor for Claude Opus 4.1. These findings demonstrate that advanced frontier models fall far short of radiologists in challenging diagnostic cases. Our benchmark highlights the present limitations of generalist AI in medical imaging and cautions against unsupervised clinical use. We also provide a qualitative analysis of reasoning traces and propose a practical taxonomy of visual reasoning errors by AI models for better understanding their failure modes, informing evaluation standards and guiding more robust model development.

Radiology's Last Exam (RadLE): Benchmarking Frontier Multimodal AI Against Human Experts and a Taxonomy of Visual Reasoning Errors in Radiology

TL;DR

RadLE v1 benchmarks frontier generalist multimodal AI against radiologists and trainees on 50 expert-level radiology spot-diagnosis cases across multiple imaging modalities. The study combines web-interface evaluations with API-based GPT-5 reasoning tests and introduces a qualitative taxonomy of visual reasoning errors to analyze failure modes. Results show radiologists outperform all AI models (83% vs 30% best), with substantial human–AI gaps, and reveal limited benefits from extended AI reasoning due to higher latency and inconsistent gains. The authors provide a practical error taxonomy to understand AI failures, discuss methodological and safety implications, and call for regulated, domain-specific evaluation to guide robust model development.

Abstract

Generalist multimodal AI systems such as large language models (LLMs) and vision language models (VLMs) are increasingly accessed by clinicians and patients alike for medical image interpretation through widely available consumer-facing chatbots. Most evaluations claiming expert level performance are on public datasets containing common pathologies. Rigorous evaluation of frontier models on difficult diagnostic cases remains limited. We developed a pilot benchmark of 50 expert-level "spot diagnosis" cases across multiple imaging modalities to evaluate the performance of frontier AI models against board-certified radiologists and radiology trainees. To mirror real-world usage, the reasoning modes of five popular frontier AI models were tested through their native web interfaces, viz. OpenAI o3, OpenAI GPT-5, Gemini 2.5 Pro, Grok-4, and Claude Opus 4.1. Accuracy was scored by blinded experts, and reproducibility was assessed across three independent runs. GPT-5 was additionally evaluated across various reasoning modes. Reasoning quality errors were assessed and a taxonomy of visual reasoning errors was defined. Board-certified radiologists achieved the highest diagnostic accuracy (83%), outperforming trainees (45%) and all AI models (best performance shown by GPT-5: 30%). Reliability was substantial for GPT-5 and o3, moderate for Gemini 2.5 Pro and Grok-4, and poor for Claude Opus 4.1. These findings demonstrate that advanced frontier models fall far short of radiologists in challenging diagnostic cases. Our benchmark highlights the present limitations of generalist AI in medical imaging and cautions against unsupervised clinical use. We also provide a qualitative analysis of reasoning traces and propose a practical taxonomy of visual reasoning errors by AI models for better understanding their failure modes, informing evaluation standards and guiding more robust model development.

Paper Structure

This paper contains 53 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Diagnostic accuracy across humans and multimodal AI systems on the Radiology’s Last Exam (RadLE) v1 benchmark. Board-certified radiologists achieved the highest accuracy (0.83), followed by trainees (0.45). All tested frontier models underperformed, with GPT-5 (0.30) and Gemini 2.5 Pro (0.29) showing the best AI results but falling well below human benchmarks.
  • Figure 2: Case distribution by clinical system, illustrating the allocation of spot-diagnosis cases across various anatomical systems within the benchmark dataset.
  • Figure 3: Overall diagnostic accuracy across reader groups. This figure presents the mean diagnostic accuracy with 95% confidence intervals for board-certified radiologists, radiology trainees, and five frontier AI models (GPT-5, Gemini 2.5 Pro, OpenAI o3, Grok-4, and Claude Opus 4.1) on 50 challenging radiology spot-diagnosis cases.
  • Figure 4: Detailed modality-specific diagnostic accuracy. This figure expands on the modality-level comparison, showing mean diagnostic accuracies with 95% confidence intervals for radiologists, trainees, and five frontier AI models (GPT-5, Gemini 2.5 Pro, OpenAI o3, Grok-4, and Claude Opus 4.1) across CT, MRI, and Radiography.
  • Figure 5: Modality-specific diagnostic accuracy for radiologists, trainees, and large-language models. This radar plot illustrates the diagnostic accuracy of board-certified radiologists, radiology trainees, and five frontier AI models (GPT-5, Gemini 2.5 Pro, OpenAI o3, Grok-4, and Claude Opus 4.1) across three imaging modalities (CT, MRI, and Radiograph).
  • ...and 2 more figures