Table of Contents
Fetching ...

Why Chain of Thought Fails in Clinical Text Understanding

Jiageng Wu, Kevin Xie, Bowen Gu, Nils Krüger, Kueiyu Joshua Lin, Jie Yang

TL;DR

This study systematically evaluates chain-of-thought prompting across 95 LLMs on 87 real-world clinical tasks in 9 languages, revealing that CoT largely degrades performance in clinical text understanding. By analyzing reasoning length, medical concept grounding, lexical signatures, and using an LLM-as-a-Judge with clinician validation, the authors identify key mechanisms for failure—long reasoning chains and weak grounding to clinical concepts—while showing that certain grounding metrics can mitigate degradation. The work provides an empirical foundation for designing safer clinical reasoning approaches that balance interpretability with reliability, and highlights substantial variability across models, languages, and task types. Overall, the findings challenge the universality of CoT benefits in safety-critical medical contexts and advocate for domain-specific strategies that ensure trustworthy performance.

Abstract

Large language models (LLMs) are increasingly being applied to clinical care, a domain where both accuracy and transparent reasoning are critical for safe and trustworthy deployment. Chain-of-thought (CoT) prompting, which elicits step-by-step reasoning, has demonstrated improvements in performance and interpretability across a wide range of tasks. However, its effectiveness in clinical contexts remains largely unexplored, particularly in the context of electronic health records (EHRs), the primary source of clinical documentation, which are often lengthy, fragmented, and noisy. In this work, we present the first large-scale systematic study of CoT for clinical text understanding. We assess 95 advanced LLMs on 87 real-world clinical text tasks, covering 9 languages and 8 task types. Contrary to prior findings in other domains, we observe that 86.3\% of models suffer consistent performance degradation in the CoT setting. More capable models remain relatively robust, while weaker ones suffer substantial declines. To better characterize these effects, we perform fine-grained analyses of reasoning length, medical concept alignment, and error profiles, leveraging both LLM-as-a-judge evaluation and clinical expert evaluation. Our results uncover systematic patterns in when and why CoT fails in clinical contexts, which highlight a critical paradox: CoT enhances interpretability but may undermine reliability in clinical text tasks. This work provides an empirical basis for clinical reasoning strategies of LLMs, highlighting the need for transparent and trustworthy approaches.

Why Chain of Thought Fails in Clinical Text Understanding

TL;DR

This study systematically evaluates chain-of-thought prompting across 95 LLMs on 87 real-world clinical tasks in 9 languages, revealing that CoT largely degrades performance in clinical text understanding. By analyzing reasoning length, medical concept grounding, lexical signatures, and using an LLM-as-a-Judge with clinician validation, the authors identify key mechanisms for failure—long reasoning chains and weak grounding to clinical concepts—while showing that certain grounding metrics can mitigate degradation. The work provides an empirical foundation for designing safer clinical reasoning approaches that balance interpretability with reliability, and highlights substantial variability across models, languages, and task types. Overall, the findings challenge the universality of CoT benefits in safety-critical medical contexts and advocate for domain-specific strategies that ensure trustworthy performance.

Abstract

Large language models (LLMs) are increasingly being applied to clinical care, a domain where both accuracy and transparent reasoning are critical for safe and trustworthy deployment. Chain-of-thought (CoT) prompting, which elicits step-by-step reasoning, has demonstrated improvements in performance and interpretability across a wide range of tasks. However, its effectiveness in clinical contexts remains largely unexplored, particularly in the context of electronic health records (EHRs), the primary source of clinical documentation, which are often lengthy, fragmented, and noisy. In this work, we present the first large-scale systematic study of CoT for clinical text understanding. We assess 95 advanced LLMs on 87 real-world clinical text tasks, covering 9 languages and 8 task types. Contrary to prior findings in other domains, we observe that 86.3\% of models suffer consistent performance degradation in the CoT setting. More capable models remain relatively robust, while weaker ones suffer substantial declines. To better characterize these effects, we perform fine-grained analyses of reasoning length, medical concept alignment, and error profiles, leveraging both LLM-as-a-judge evaluation and clinical expert evaluation. Our results uncover systematic patterns in when and why CoT fails in clinical contexts, which highlight a critical paradox: CoT enhances interpretability but may undermine reliability in clinical text tasks. This work provides an empirical basis for clinical reasoning strategies of LLMs, highlighting the need for transparent and trustworthy approaches.

Paper Structure

This paper contains 54 sections, 4 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Performance of LLMs in clinical text understanding under Zero-shot and CoT.
  • Figure 2: Evaluating CoT in clinical text understanding and probing why it fails.
  • Figure 3: CoT prompting shows diminishing negative impact as model capability increases.
  • Figure 4: Performance under CoT prompting decreases as reasoning length increases.
  • Figure 5: Impact of medical concept alignment on performance. Both Overlap and Coverage alignment metrics show systematic performance improvements across percentiles.
  • ...and 7 more figures