Do Large Language Models have Shared Weaknesses in Medical Question Answering?

Andrew M. Bean; Karolina Korgul; Felix Krones; Robert McCraith; Adam Mahdi

Do Large Language Models have Shared Weaknesses in Medical Question Answering?

Andrew M. Bean, Karolina Korgul, Felix Krones, Robert McCraith, Adam Mahdi

TL;DR

This study examines whether large language models share weaknesses in medical question answering by evaluating 16 top LLMs on 874 LEK questions and comparing results to human performance. It uses Top-1 accuracy and Expected Accuracy, along with logistic regressions to relate correctness to question length and model confidence, revealing that larger models generally perform better but training data and architecture strongly influence outcomes. The results show cross-model correlations ($0.39$–$0.58$) and modest alignment with human difficulty ($0.09$–$0.13$), with higher confidence correlating with accuracy and longer questions reducing performance; medical jurisprudence remains notably challenging due to local legal contexts. The findings suggest that reliability patterns in LLMs persist across generations, informing safe deployment, model selection, and the importance of human oversight in medical QA systems.

Abstract

Large language models (LLMs) have made rapid improvement on medical benchmarks, but their unreliability remains a persistent challenge for safe real-world uses. To design for the use LLMs as a category, rather than for specific models, requires developing an understanding of shared strengths and weaknesses which appear across models. To address this challenge, we benchmark a range of top LLMs and identify consistent patterns across models. We test $16$ well-known LLMs on $874$ newly collected questions from Polish medical licensing exams. For each question, we score each model on the top-1 accuracy and the distribution of probabilities assigned. We then compare these results with factors such as question difficulty for humans, question length, and the scores of the other models. LLM accuracies were positively correlated pairwise ($0.39$ to $0.58$). Model performance was also correlated with human performance ($0.09$ to $0.13$), but negatively correlated to the difference between the question-level accuracy of top-scoring and bottom-scoring humans ($-0.09$ to $-0.14$). The top output probability and question length were positive and negative predictors of accuracy respectively (p$< 0.05$). The top scoring LLM, GPT-4o Turbo, scored $84\%$, with Claude Opus, Gemini 1.5 Pro and Llama 3/3.1 between $74\%$ and $79\%$. We found evidence of similarities between models in which questions they answer correctly, as well as similarities with human test takers. Larger models typically performed better, but differences in training, architecture, and data were also highly impactful. Model accuracy was positively correlated with confidence, but negatively correlated with question length. We find similar results with older models, and argue that these patterns are likely to persist across future models using similar training methods.

Do Large Language Models have Shared Weaknesses in Medical Question Answering?

TL;DR

–

) and modest alignment with human difficulty (

–

), with higher confidence correlating with accuracy and longer questions reducing performance; medical jurisprudence remains notably challenging due to local legal contexts. The findings suggest that reliability patterns in LLMs persist across generations, informing safe deployment, model selection, and the importance of human oversight in medical QA systems.

Abstract

well-known LLMs on

newly collected questions from Polish medical licensing exams. For each question, we score each model on the top-1 accuracy and the distribution of probabilities assigned. We then compare these results with factors such as question difficulty for humans, question length, and the scores of the other models. LLM accuracies were positively correlated pairwise (

). Model performance was also correlated with human performance (

), but negatively correlated to the difference between the question-level accuracy of top-scoring and bottom-scoring humans (

). The top output probability and question length were positive and negative predictors of accuracy respectively (p

). The top scoring LLM, GPT-4o Turbo, scored

, with Claude Opus, Gemini 1.5 Pro and Llama 3/3.1 between

and

. We found evidence of similarities between models in which questions they answer correctly, as well as similarities with human test takers. Larger models typically performed better, but differences in training, architecture, and data were also highly impactful. Model accuracy was positively correlated with confidence, but negatively correlated with question length. We find similar results with older models, and argue that these patterns are likely to persist across future models using similar training methods.

Paper Structure (22 sections, 2 equations, 10 tables)

This paper contains 22 sections, 2 equations, 10 tables.

Introduction
Methods
Dataset
Format
Human comparison
Models and implementation details
Prompting
Evaluation metrics
Top-1 Accuracy
Expected Accuracy
Logistic regression models
Results
Accuracy and expected accuracy scores
Correlations with other LLMs
Correlations with human test takers
...and 7 more sections

Do Large Language Models have Shared Weaknesses in Medical Question Answering?

TL;DR

Abstract

Do Large Language Models have Shared Weaknesses in Medical Question Answering?

Authors

TL;DR

Abstract

Table of Contents