SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy

Meghal Dani; Muthu Jeyanthi Prakash; Zeynep Akata; Stefanie Liebe

SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy

Meghal Dani, Muthu Jeyanthi Prakash, Zeynep Akata, Stefanie Liebe

TL;DR

SemioLLM demonstrates that large language models can derive probabilistic seizure onset zone localizations from unstructured seizure narratives, with prompt engineering and expert-like reasoning substantially boosting performance toward clinician-level accuracy. The study systematically compares six SOTA LLMs, assesses confidence and calibration, and analyzes reasoning quality and source attribution, revealing GPT-4's strengths in domain integration and citation accuracy while highlighting limitations in grounding and hallucinations. By introducing a scalable, domain-adaptable evaluation framework, SemioLLM provides practical guidelines for deploying AI in clinical settings where narrative descriptions drive diagnosis and can generalize to other medical domains reliant on unstructured text. The results emphasize the need for robust grounding, multilingual adaptation, and expert-aligned prompting to ensure safe, interpretable, and globally applicable AI in healthcare.

Abstract

Large Language Models (LLMs) have been shown to encode clinical knowledge. Many evaluations, however, rely on structured question-answer benchmarks, overlooking critical challenges of interpreting and reasoning about unstructured clinical narratives in real-world settings. Using free-text clinical descriptions, we present SemioLLM, an evaluation framework that benchmarks 6 state-of-the-art models (GPT-3.5, GPT-4, Mixtral-8x7B, Qwen-72B, LlaMa2, LlaMa3) on a core diagnostic task in epilepsy. Leveraging a database of 1,269 seizure descriptions, we show that most LLMs are able to accurately and confidently generate probabilistic predictions of seizure onset zones in the brain. Most models approach clinician-level performance after prompt engineering, with expert-guided chain-of-thought reasoning leading to the most consistent improvements. Performance was further strongly modulated by clinical in-context impersonation, narrative length and language context (13.7%, 32.7% and 14.2% performance variation, respectively). However, expert analysis of reasoning outputs revealed that correct prediction can be based on hallucinated knowledge and deficient source citation accuracy, underscoring the need to improve interpretability of LLMs in clinical use. Overall, SemioLLM provides a scalable, domain-adaptable framework for evaluating LLMs in clinical disciplines where unstructured verbal descriptions encode diagnostic information. By identifying both the strengths and limitations of state-of-the-art models, our work supports the development of clinically robust and globally applicable AI systems for healthcare.

SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy

TL;DR

Abstract

Paper Structure (23 sections, 8 equations, 9 figures, 3 tables)

This paper contains 23 sections, 8 equations, 9 figures, 3 tables.

Introduction
Results
Prompt strategies significantly boost performance
High confidence does not guarantee correctness
Evaluating Clinical Reasoning and Source Attribution
Factors influencing LLM performance in seizure diagnostics
Symptom description length
Clinical in-context impersonation
Multilingual Performance
Discussion
Online Methods
Acknowledgments
Author Contribution
Methods
Dataset and curation
...and 8 more sections

Figures (9)

Figure Figure 1: Overview of SemioLLM: We consider six SOTA models including open-source and proprietary LLMs and evaluate them across five standard prompt styles for the task of SOZ localization. Model outputs include likelihood estimates of seven major brain regions, reasoning and source citations and are evaluated for accuracy and confidence. The best performing models are examined in more detail with respect to a) task comprehension, logical reasoning, knowledge retrieval, clinical safety and source citation verification, (b) impact of symptom description length, c) in-context clinical impersonation, and d) multilingual alignment and understanding
Figure Figure 2: Performance comparison of SOTA LLMs and impact of prompt engineering strategies [Zero-Shot (ZS), Few-Shot (FS), ZS-Chain-of-Thought (CoT), FS-CoT and Self Consistency (SC)]. (a) Mean F1 scores for all models obtained by bootstrapping. The boxplot highlights a significant improvement with advanced prompt styles, showing performance comparable to clinicians and at par with naive classifier (F1 score of 38.2% (red dashed line) (b) Confidence scores improve consistently with in-context learning, with FS and FS-CoT demonstrating the highest gains. Confidence score=1/0 for green/red, respectively. (c) Calibration (Brier Score Loss, BCL) for each model and prompt style, with FS-CoT and SC showing the best calibration (least BCL). (d) Multidimensional performance visualization comparing model correctness, confidence, and calibration metrics, with solid lines representing the best-performing models
Figure Figure 3: Evaluation of model reasoning. (a) Example query and corresponding annotations for a given semiology from GPT-4 and Mixtral-8x7B (b) Correctness and completeness of model outputs (b) Breakdown of model performance in reading comprehension, knowledge recall, and reasoning accuracy (d) Comparison of average citation accuracy across models. Note that Inter-rater reliability was high, with Cohen’s kappa scores of 0.88 for GPT-4 and 0.78 for Mixtral-8x7B, indicating strong agreement between evaluators.
Figure Figure 4: Impact of description length, persona adaptation and language on model performance (a) Performance for both models across various description-length bins and length-shuffled inputs. Mean F1 scores are shown, with error bars indicating $\pm$1 SEM, and each bin’s sample size ($N$) is indicated in parentheses. Note that each range $[x,y)$ includes $x$ but excludes $y$ (b) Influence of in-context persona adaptation on zero-shot performance, shown as changes in F1 score (red/green) and confidence (blue/dark blue) relative to the AI assistant persona. (c) Effect of language variation on performance. In the "Same English" condition, both the semiology description and prompt were in English. In the "Cross-Language" condition, only the semiology description was in a different language. In the "Same Language" condition, both the prompt and semiology description were in a non-English language.
Figure Extended Data Figure 1: Data preprocessing pipeline: We use Semio2Brain semio2brain dataset which is a collection of 2567 semiologies spread across 7 major brain regions. Steps in preprocessing this data include abbreviation removal and replacing them with their respective full-forms, correction of spelling errors present in the data, removing uninformative words and semiology categories. This result in overall 1269 rows we finally use for our analysis.
...and 4 more figures

SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy

TL;DR

Abstract

SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy

Authors

TL;DR

Abstract

Table of Contents

Figures (9)