Diagnosing Hallucination Risk in AI Surgical Decision-Support: A Sequential Framework for Sequential Validation
Dong Chen, Yanzhe Wei, Zonglin He, Guan-Ming Kuang, Canhua Ye, Meiru An, Huili Peng, Yong Hu, Huiren Tao, Kenneth MC Cheung
TL;DR
This study tackles hallucination risks in LLM-driven spine-surgery decision support by introducing a clinician-centered sequential validation framework that assesses diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment across a two-round, high-stakes stress test. Six models were evaluated on 30 expert-validated spinal cases, with DeepSeek-R1 achieving the top overall score of $86.03 \pm 2.08$, while extended chain-of-thought variants did not consistently improve clinical reliability (e.g., Claude-3.7-Sonnet extended thinking: $80.79 \pm 1.83$ vs $81.56 \pm 1.92$ for standard). The results reveal model-specific vulnerabilities—most notably a $7.4\%$ drop in recommendation quality under amplified complexity and a divergence between high rationality ($+2.0\%$), readability ($+1.7\%$), and actual clinical guidance—highlighting the need for interpretability tools and safety-centric governance. The authors propose sequencing-based evaluation, retrieval-grounded inference, and surgeon-in-the-loop safeguards as essential steps toward safe deployment of AI in high-risk surgical workflows, and argue that current LLMs remain unsuitable for autonomous use in spine surgery.
Abstract
Large language models (LLMs) offer transformative potential for clinical decision support in spine surgery but pose significant risks through hallucinations, which are factually inconsistent or contextually misaligned outputs that may compromise patient safety. This study introduces a clinician-centered framework to quantify hallucination risks by evaluating diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment. We assessed six leading LLMs across 30 expert-validated spinal cases. DeepSeek-R1 demonstrated superior overall performance (total score: 86.03 $\pm$ 2.08), particularly in high-stakes domains such as trauma and infection. A critical finding reveals that reasoning-enhanced model variants did not uniformly outperform standard counterparts: Claude-3.7-Sonnet's extended thinking mode underperformed relative to its standard version (80.79 $\pm$ 1.83 vs. 81.56 $\pm$ 1.92), indicating extended chain-of-thought reasoning alone is insufficient for clinical reliability. Multidimensional stress-testing exposed model-specific vulnerabilities, with recommendation quality degrading by 7.4% under amplified complexity. This decline contrasted with marginal improvements in rationality (+2.0%), readability (+1.7%) and diagnosis (+4.7%), highlighting a concerning divergence between perceived coherence and actionable guidance. Our findings advocate integrating interpretability mechanisms (e.g., reasoning chain visualization) into clinical workflows and establish a safety-aware validation framework for surgical LLM deployment.
