Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning
Jonathan Kim, Anna Podlasek, Kie Shidara, Feng Liu, Ahmed Alaa, Danilo Bernardo
TL;DR
This work introduces the Medical Abstraction and Reasoning Corpus (M-ARC), an adversarial benchmark designed to elicit inflexible, Einstellung-like reasoning in large language models (LLMs) for clinical problem-solving. By comparing LLMs to physicians across 100 long-tail, USMLE-like questions that include open-ended and data-seeking elements, the study reveals widespread underperformance and hallucinations among state-of-the-art models, with only modest gains for the largest systems. Uncertainty analyses further show that LLMs are frequently overconfident despite limited accuracy, highlighting safety concerns for clinical use. The findings argue for cautious deployment, selective prediction strategies, and the need for rigorous benchmarks that stress generalization and reasoning flexibility in medical AI.
Abstract
Large Language Models (LLMs) have attained human-level accuracy on medical question-answer (QA) benchmarks. However, their limitations in navigating open-ended clinical scenarios have recently been shown, raising concerns about the robustness and generalizability of LLM reasoning across diverse, real-world medical tasks. To probe potential LLM failure modes in clinical problem-solving, we present the medical abstraction and reasoning corpus (M-ARC). M-ARC assesses clinical reasoning through scenarios designed to exploit the Einstellung effect -- the fixation of thought arising from prior experience, targeting LLM inductive biases toward inflexible pattern matching from their training data rather than engaging in flexible reasoning. We find that LLMs, including current state-of-the-art o1 and Gemini models, perform poorly compared to physicians on M-ARC, often demonstrating lack of commonsense medical reasoning and a propensity to hallucinate. In addition, uncertainty estimation analyses indicate that LLMs exhibit overconfidence in their answers, despite their limited accuracy. The failure modes revealed by M-ARC in LLM medical reasoning underscore the need to exercise caution when deploying these models in clinical settings.
