Table of Contents
Fetching ...

Diagnosing Hallucination Risk in AI Surgical Decision-Support: A Sequential Framework for Sequential Validation

Dong Chen, Yanzhe Wei, Zonglin He, Guan-Ming Kuang, Canhua Ye, Meiru An, Huili Peng, Yong Hu, Huiren Tao, Kenneth MC Cheung

TL;DR

This study tackles hallucination risks in LLM-driven spine-surgery decision support by introducing a clinician-centered sequential validation framework that assesses diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment across a two-round, high-stakes stress test. Six models were evaluated on 30 expert-validated spinal cases, with DeepSeek-R1 achieving the top overall score of $86.03 \pm 2.08$, while extended chain-of-thought variants did not consistently improve clinical reliability (e.g., Claude-3.7-Sonnet extended thinking: $80.79 \pm 1.83$ vs $81.56 \pm 1.92$ for standard). The results reveal model-specific vulnerabilities—most notably a $7.4\%$ drop in recommendation quality under amplified complexity and a divergence between high rationality ($+2.0\%$), readability ($+1.7\%$), and actual clinical guidance—highlighting the need for interpretability tools and safety-centric governance. The authors propose sequencing-based evaluation, retrieval-grounded inference, and surgeon-in-the-loop safeguards as essential steps toward safe deployment of AI in high-risk surgical workflows, and argue that current LLMs remain unsuitable for autonomous use in spine surgery.

Abstract

Large language models (LLMs) offer transformative potential for clinical decision support in spine surgery but pose significant risks through hallucinations, which are factually inconsistent or contextually misaligned outputs that may compromise patient safety. This study introduces a clinician-centered framework to quantify hallucination risks by evaluating diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment. We assessed six leading LLMs across 30 expert-validated spinal cases. DeepSeek-R1 demonstrated superior overall performance (total score: 86.03 $\pm$ 2.08), particularly in high-stakes domains such as trauma and infection. A critical finding reveals that reasoning-enhanced model variants did not uniformly outperform standard counterparts: Claude-3.7-Sonnet's extended thinking mode underperformed relative to its standard version (80.79 $\pm$ 1.83 vs. 81.56 $\pm$ 1.92), indicating extended chain-of-thought reasoning alone is insufficient for clinical reliability. Multidimensional stress-testing exposed model-specific vulnerabilities, with recommendation quality degrading by 7.4% under amplified complexity. This decline contrasted with marginal improvements in rationality (+2.0%), readability (+1.7%) and diagnosis (+4.7%), highlighting a concerning divergence between perceived coherence and actionable guidance. Our findings advocate integrating interpretability mechanisms (e.g., reasoning chain visualization) into clinical workflows and establish a safety-aware validation framework for surgical LLM deployment.

Diagnosing Hallucination Risk in AI Surgical Decision-Support: A Sequential Framework for Sequential Validation

TL;DR

This study tackles hallucination risks in LLM-driven spine-surgery decision support by introducing a clinician-centered sequential validation framework that assesses diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment across a two-round, high-stakes stress test. Six models were evaluated on 30 expert-validated spinal cases, with DeepSeek-R1 achieving the top overall score of , while extended chain-of-thought variants did not consistently improve clinical reliability (e.g., Claude-3.7-Sonnet extended thinking: vs for standard). The results reveal model-specific vulnerabilities—most notably a drop in recommendation quality under amplified complexity and a divergence between high rationality (), readability (), and actual clinical guidance—highlighting the need for interpretability tools and safety-centric governance. The authors propose sequencing-based evaluation, retrieval-grounded inference, and surgeon-in-the-loop safeguards as essential steps toward safe deployment of AI in high-risk surgical workflows, and argue that current LLMs remain unsuitable for autonomous use in spine surgery.

Abstract

Large language models (LLMs) offer transformative potential for clinical decision support in spine surgery but pose significant risks through hallucinations, which are factually inconsistent or contextually misaligned outputs that may compromise patient safety. This study introduces a clinician-centered framework to quantify hallucination risks by evaluating diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment. We assessed six leading LLMs across 30 expert-validated spinal cases. DeepSeek-R1 demonstrated superior overall performance (total score: 86.03 2.08), particularly in high-stakes domains such as trauma and infection. A critical finding reveals that reasoning-enhanced model variants did not uniformly outperform standard counterparts: Claude-3.7-Sonnet's extended thinking mode underperformed relative to its standard version (80.79 1.83 vs. 81.56 1.92), indicating extended chain-of-thought reasoning alone is insufficient for clinical reliability. Multidimensional stress-testing exposed model-specific vulnerabilities, with recommendation quality degrading by 7.4% under amplified complexity. This decline contrasted with marginal improvements in rationality (+2.0%), readability (+1.7%) and diagnosis (+4.7%), highlighting a concerning divergence between perceived coherence and actionable guidance. Our findings advocate integrating interpretability mechanisms (e.g., reasoning chain visualization) into clinical workflows and establish a safety-aware validation framework for surgical LLM deployment.

Paper Structure

This paper contains 18 sections, 7 figures.

Figures (7)

  • Figure 1: Agreement analysis of spinal disorder scoring among three examiners. (a) correlation heatmap, (b) disease-specific concordance, and (c) score distribution.
  • Figure 2: Comparative performance of large language models in two-round evaluation of spine surgical cases.
  • Figure 3: Two-round model performances across spinal disease categories. (a) Deformity (Scoliosis). (b) Trauma. (c) Deformity (Kyphosis). (d) Infection. (e) Neoplasms. (f) Degeneration (Thoracolumbar). (g) Degeneration (Cervical).
  • Figure 4: Comparative assessment of large language models across clinical-oriented dimensions in spinal surgical decision support. (a) Overall performances. (b) First-Round initial assessment. (c) Second-Round follow-up & treatment assessment.
  • Figure 5: AI-oriented comparative performance of large language models for spinal surgical diseases. (a) easy clinical cases. (b) difficult clinical cases.
  • ...and 2 more figures