Table of Contents
Fetching ...

Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning

Jonathan Kim, Anna Podlasek, Kie Shidara, Feng Liu, Ahmed Alaa, Danilo Bernardo

TL;DR

This work introduces the Medical Abstraction and Reasoning Corpus (M-ARC), an adversarial benchmark designed to elicit inflexible, Einstellung-like reasoning in large language models (LLMs) for clinical problem-solving. By comparing LLMs to physicians across 100 long-tail, USMLE-like questions that include open-ended and data-seeking elements, the study reveals widespread underperformance and hallucinations among state-of-the-art models, with only modest gains for the largest systems. Uncertainty analyses further show that LLMs are frequently overconfident despite limited accuracy, highlighting safety concerns for clinical use. The findings argue for cautious deployment, selective prediction strategies, and the need for rigorous benchmarks that stress generalization and reasoning flexibility in medical AI.

Abstract

Large Language Models (LLMs) have attained human-level accuracy on medical question-answer (QA) benchmarks. However, their limitations in navigating open-ended clinical scenarios have recently been shown, raising concerns about the robustness and generalizability of LLM reasoning across diverse, real-world medical tasks. To probe potential LLM failure modes in clinical problem-solving, we present the medical abstraction and reasoning corpus (M-ARC). M-ARC assesses clinical reasoning through scenarios designed to exploit the Einstellung effect -- the fixation of thought arising from prior experience, targeting LLM inductive biases toward inflexible pattern matching from their training data rather than engaging in flexible reasoning. We find that LLMs, including current state-of-the-art o1 and Gemini models, perform poorly compared to physicians on M-ARC, often demonstrating lack of commonsense medical reasoning and a propensity to hallucinate. In addition, uncertainty estimation analyses indicate that LLMs exhibit overconfidence in their answers, despite their limited accuracy. The failure modes revealed by M-ARC in LLM medical reasoning underscore the need to exercise caution when deploying these models in clinical settings.

Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning

TL;DR

This work introduces the Medical Abstraction and Reasoning Corpus (M-ARC), an adversarial benchmark designed to elicit inflexible, Einstellung-like reasoning in large language models (LLMs) for clinical problem-solving. By comparing LLMs to physicians across 100 long-tail, USMLE-like questions that include open-ended and data-seeking elements, the study reveals widespread underperformance and hallucinations among state-of-the-art models, with only modest gains for the largest systems. Uncertainty analyses further show that LLMs are frequently overconfident despite limited accuracy, highlighting safety concerns for clinical use. The findings argue for cautious deployment, selective prediction strategies, and the need for rigorous benchmarks that stress generalization and reasoning flexibility in medical AI.

Abstract

Large Language Models (LLMs) have attained human-level accuracy on medical question-answer (QA) benchmarks. However, their limitations in navigating open-ended clinical scenarios have recently been shown, raising concerns about the robustness and generalizability of LLM reasoning across diverse, real-world medical tasks. To probe potential LLM failure modes in clinical problem-solving, we present the medical abstraction and reasoning corpus (M-ARC). M-ARC assesses clinical reasoning through scenarios designed to exploit the Einstellung effect -- the fixation of thought arising from prior experience, targeting LLM inductive biases toward inflexible pattern matching from their training data rather than engaging in flexible reasoning. We find that LLMs, including current state-of-the-art o1 and Gemini models, perform poorly compared to physicians on M-ARC, often demonstrating lack of commonsense medical reasoning and a propensity to hallucinate. In addition, uncertainty estimation analyses indicate that LLMs exhibit overconfidence in their answers, despite their limited accuracy. The failure modes revealed by M-ARC in LLM medical reasoning underscore the need to exercise caution when deploying these models in clinical settings.

Paper Structure

This paper contains 10 sections, 6 figures.

Figures (6)

  • Figure 1: Demonstration of M-ARC question utilizing long-tail reasoning pattern. The presented information is a commonly seen medical QA text pattern (anticoagulant leading to a brain bleed). The adversarial answer choice targets reliance on rote pattern matching. However, the adversarial answer choice is easily avoided with deductive reasoning through logical negation---complete absence of a brain renders a brain bleed impossible. This clinical situation represents a long-tail reasoning pattern further obscuring the correct answer.
  • Figure 2: Comparison of LLM and human performance on M-ARC. The bar heights represent the accuracy of each model, with colors indicative of the respective model family. The final bar represents human performance (0.66), averaged across five physicians, with a standard error bar (±0.053). Gemini-1.5-Pro and o1 achieved the highest performance with accuracies of 0.5 and 0.48, respectively.
  • Figure 3: In this example question, o1's incorrect response reveals a failure in fundamental medical commonsense reasoning and hallucination---the assertion that blood pressures can be measured on the forehead is false.
  • Figure 4: In this example question, GPT4o's incorrect response arises from a deductive reasoning error in integrating key details about the patient's condition: (1) The patient lacks a brain, and (2) in the absence of a brain, normal EEG activity cannot be expected. Therefore, GPT4's reasoning that there is a possibility of an intracranial hemorrhage on the basis of abnormal EEG is logically flawed. The problem does not provide information on the chronicity of lethargy which in this case could be chronic, thus obtaining additional history is warranted prior to consideration of treatment.
  • Figure 5: In this example question, GPT4o's incorrect response and subsequent reasoning reveal a deficiency in medical commonsense reasoning. A basic principle---both widely taught and intuitively obvious---is that the first step in assessing a patient who appears to be unconscious is to attempt to wake them.
  • ...and 1 more figures