Performance of Large Language Models in Supporting Medical Diagnosis and Treatment

Diogo Sousa; Guilherme Barbosa; Catarina Rocha; Dulce Oliveira

Performance of Large Language Models in Supporting Medical Diagnosis and Treatment

Diogo Sousa, Guilherme Barbosa, Catarina Rocha, Dulce Oliveira

TL;DR

This study benchmarks 21 contemporary LLMs on the 2024 Portuguese PNA to assess medical knowledge and reasoning in Portuguese, revealing substantial variation in accuracy and cost-effectiveness. Using a non-fine-tuned, pass@1 evaluation on 150 MCQs, the authors introduce a composite Score that weights accuracy, cost, and contamination risk via $Score = 100 \times (Correct/N)^3 \times \frac{1}{\sqrt{1 + \log_{10}(P + 1)}} \times C_{\text{risk}}$, showing several models close to or surpassing top student performance while favoring affordable options. The results indicate that some low- to zero-cost models achieve strong performance, and explicit reasoning methods (CoT/CoD) can enhance accuracy, though latency and resource costs must be managed. The discussion emphasizes AI-human collaboration, reliability, privacy, and regulatory considerations (e.g., GDPR and EU AI Act) as essential for safe clinical deployment, and calls for real-world clinical vignette testing and blinded benchmarks to advance trustworthy adoption.

Abstract

The integration of Large Language Models (LLMs) into healthcare holds significant potential to enhance diagnostic accuracy and support medical treatment planning. These AI-driven systems can analyze vast datasets, assisting clinicians in identifying diseases, recommending treatments, and predicting patient outcomes. This study evaluates the performance of a range of contemporary LLMs, including both open-source and closed-source models, on the 2024 Portuguese National Exam for medical specialty access (PNA), a standardized medical knowledge assessment. Our results highlight considerable variation in accuracy and cost-effectiveness, with several models demonstrating performance exceeding human benchmarks for medical students on this specific task. We identify leading models based on a combined score of accuracy and cost, discuss the implications of reasoning methodologies like Chain-of-Thought, and underscore the potential for LLMs to function as valuable complementary tools aiding medical professionals in complex clinical decision-making.

Performance of Large Language Models in Supporting Medical Diagnosis and Treatment

TL;DR

Abstract

Performance of Large Language Models in Supporting Medical Diagnosis and Treatment

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)