Table of Contents
Fetching ...

Medical Hallucinations in Foundation Models and Their Impact on Healthcare

Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Chanwoo Park, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, Rodrigo Gameiro, Lizhou Fan, Eugene Park, Tristan Lin, Joonsik Yoon, Wonjin Yoon, Maarten Sap, Yulia Tsvetkov, Paul Liang, Xuhai Xu, Xin Liu, Chunjong Park, Hyeonhoon Lee, Hae Won Park, Daniel McDuff, Samir Tulebaev, Cynthia Breazeal

TL;DR

The paper defines medical hallucination as output that is factually incorrect, logically inconsistent, or unsupported by evidence with potential to affect clinical decisions. It argues that hallucinations in medicine are primarily due to reasoning failures rather than mere knowledge gaps, demonstrated through a benchmark across 11 foundation models and seven medical tasks, plus a clinician survey. The findings show general-purpose models more robust to hallucinations than medical-specialized ones, with chain-of-thought prompting significantly reducing errors and highlighting the value of explicit reasoning traces. The work also presents a taxonomy and detection/evaluation framework, plus mitigation strategies (data, model, retrieval, and prompting techniques), and discusses regulatory and ethical implications for deploying AI in healthcare. Overall, advancing reasoning transparency and robust uncertainty management is positioned as essential for trustworthy clinical AI, rather than relying solely on domain-specific pretraining or fine-tuning.

Abstract

Hallucinations in foundation models arise from autoregressive training objectives that prioritize token-likelihood optimization over epistemic accuracy, fostering overconfidence and poorly calibrated uncertainty. We define medical hallucination as any model-generated output that is factually incorrect, logically inconsistent, or unsupported by authoritative clinical evidence in ways that could alter clinical decisions. We evaluated 11 foundation models (7 general-purpose, 4 medical-specialized) across seven medical hallucination tasks spanning medical reasoning and biomedical information retrieval. General-purpose models achieved significantly higher proportions of hallucination-free responses than medical-specialized models (median: 76.6% vs 51.3%, difference = 25.2%, 95% CI: 18.7-31.3%, Mann-Whitney U = 27.0, p = 0.012, rank-biserial r = -0.64). Top-performing models such as Gemini-2.5 Pro exceeded 97% accuracy when augmented with chain-of-thought prompting (base: 87.6%), while medical-specialized models like MedGemma ranged from 28.6-61.9% despite explicit training on medical corpora. Chain-of-thought reasoning significantly reduced hallucinations in 86.4% of tested comparisons after FDR correction (q < 0.05), demonstrating that explicit reasoning traces enable self-verification and error detection. Physician audits confirmed that 64-72% of residual hallucinations stemmed from causal or temporal reasoning failures rather than knowledge gaps. A global survey of clinicians (n = 70) validated real-world impact: 91.8% had encountered medical hallucinations, and 84.7% considered them capable of causing patient harm. The underperformance of medical-specialized models despite domain training indicates that safety emerges from sophisticated reasoning capabilities and broad knowledge integration developed during large-scale pre-training, not from narrow optimization.

Medical Hallucinations in Foundation Models and Their Impact on Healthcare

TL;DR

The paper defines medical hallucination as output that is factually incorrect, logically inconsistent, or unsupported by evidence with potential to affect clinical decisions. It argues that hallucinations in medicine are primarily due to reasoning failures rather than mere knowledge gaps, demonstrated through a benchmark across 11 foundation models and seven medical tasks, plus a clinician survey. The findings show general-purpose models more robust to hallucinations than medical-specialized ones, with chain-of-thought prompting significantly reducing errors and highlighting the value of explicit reasoning traces. The work also presents a taxonomy and detection/evaluation framework, plus mitigation strategies (data, model, retrieval, and prompting techniques), and discusses regulatory and ethical implications for deploying AI in healthcare. Overall, advancing reasoning transparency and robust uncertainty management is positioned as essential for trustworthy clinical AI, rather than relying solely on domain-specific pretraining or fine-tuning.

Abstract

Hallucinations in foundation models arise from autoregressive training objectives that prioritize token-likelihood optimization over epistemic accuracy, fostering overconfidence and poorly calibrated uncertainty. We define medical hallucination as any model-generated output that is factually incorrect, logically inconsistent, or unsupported by authoritative clinical evidence in ways that could alter clinical decisions. We evaluated 11 foundation models (7 general-purpose, 4 medical-specialized) across seven medical hallucination tasks spanning medical reasoning and biomedical information retrieval. General-purpose models achieved significantly higher proportions of hallucination-free responses than medical-specialized models (median: 76.6% vs 51.3%, difference = 25.2%, 95% CI: 18.7-31.3%, Mann-Whitney U = 27.0, p = 0.012, rank-biserial r = -0.64). Top-performing models such as Gemini-2.5 Pro exceeded 97% accuracy when augmented with chain-of-thought prompting (base: 87.6%), while medical-specialized models like MedGemma ranged from 28.6-61.9% despite explicit training on medical corpora. Chain-of-thought reasoning significantly reduced hallucinations in 86.4% of tested comparisons after FDR correction (q < 0.05), demonstrating that explicit reasoning traces enable self-verification and error detection. Physician audits confirmed that 64-72% of residual hallucinations stemmed from causal or temporal reasoning failures rather than knowledge gaps. A global survey of clinicians (n = 70) validated real-world impact: 91.8% had encountered medical hallucinations, and 84.7% considered them capable of causing patient harm. The underperformance of medical-specialized models despite domain training indicates that safety emerges from sophisticated reasoning capabilities and broad knowledge integration developed during large-scale pre-training, not from narrow optimization.

Paper Structure

This paper contains 113 sections, 1 equation, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Overview of medical hallucinations generated by state-of-the-art LLMs.(a) Medical expert-annotated hallucination rates and potential risk assessments on three medical reasoning tasks with NEJM Medical Records. The hallucination rate is defined as the percentage of responses containing expert-identified errors (see Section \ref{['sec:new_sec7']} for full analysis). (b) Representative examples of medical hallucinations from chen2024detectingvishwanath2024faithfulness respectively. (c) Geographic distribution of clinician-reported medical hallucination incidents providing a global perspective on the issue (see Subsection \ref{['subsec:survey_results']} for full analysis).
  • Figure 2: A visual taxonomy of medical hallucinations in LLMs, organized into five main clusters.(a) Factual Errors: Hallucinations arising from incorrect or conflicting factual information, encompassing Non-Factual Hallucination, Factual Hallucination, and Input-Conflicting Hallucination. (b) Outdated References: Errors due to reliance on outdated or disproven guidelines or data, exemplified by Memory-Based Hallucination. (c) Spurious Correlations: Hallucinations that merge or misinterpret data in ways that produce unfounded conclusions, including Bias-Induced Hallucination, Amalgamated Hallucination, and Multimodal Integration Hallucination. (d) Fabricated Sources or Guidelines: Inventions or misrepresentations of medical procedures and research, covering Procedural Hallucination and Research Hallucination. (e) Incomplete Chains of Reasoning: Flawed or partial logical processes, such as Reasoning Hallucination, Decision-Making Hallucination, and Diagnostic Hallucination.
  • Figure 3: Prompt examples for each step of the Chain-of-Knowledge framework. This example is from the original paper li2024chain.
  • Figure 4: Hallucination Pointwise Score vs. Similarity Score of LLMs on the Med-Halt hallucination benchmark. This result reveals that the recent advanced models (e.g. o3-mini, deepseek-r1, and gemini-2.5-pro) typically start with high baseline hallucination resistance and tend to see moderate but consistent gains from a simple CoT, while previous models including medical-purpose LLMs often begin at low hallucination resistance yet can benefit from different approaches (e.g. Search, CoT, and System Prompt). Moreover, retrieval-augmented generation can be less effective if the model struggles to reconcile retrieved information with its internal knowledge.
  • Figure 5: An annotation process of medical hallucinations in LLMs (Section \ref{['sec:new_sec7']}). We utilize New England Journal of Medicine (NEJM) case records, parsing them into key elements, and feeds them into the LLM for response generation. Physicians then annotate LLM-generated responses to identify medical hallucinations and potential risks, as exemplified by the inaccurate reporting of 'irregular pulse' in the patient's Emergency Department findings.
  • ...and 9 more figures