Table of Contents
Fetching ...

Beyond Negation Detection: Comprehensive Assertion Detection Models for Clinical NLP

Veysel Kocaman, Yigit Gul, M. Aytug Kaya, Hasham Ul Haq, Mehmet Butgul, Cabir Celik, David Talby

TL;DR

This work broadens clinical NLP assertion detection beyond negation to a comprehensive, multi-category framework evaluated on real-world data. It compares diverse architectures—fine-tuned LLMs, Bi-LSTM DL, BERT-based sequence classification, few-shot transformers, and a rule-based contextual module—within a Spark NLP Healthcare pipeline, against NegEx and commercial APIs. The fine-tuned LLM achieves the highest accuracy (0.962) but at substantial computational cost, while domain-adapted DL and few-shot models deliver strong, cost-effective performance, with the Combined Pipeline providing a practical balance of accuracy and efficiency. Overall, the study demonstrates that small, domain-specific models can outperform black-box APIs in both accuracy and scalability, supporting robust, production-ready clinical assertion detection in healthcare NLP.

Abstract

Assertion status detection is a critical yet often overlooked component of clinical NLP, essential for accurately attributing extracted medical facts. Past studies have narrowly focused on negation detection, leading to underperforming commercial solutions such as AWS Medical Comprehend, Azure AI Text Analytics, and GPT-4o due to their limited domain adaptation. To address this gap, we developed state-of-the-art assertion detection models, including fine-tuned LLMs, transformer-based classifiers, few-shot classifiers, and deep learning (DL) approaches. We evaluated these models against cloud-based commercial API solutions, the legacy rule-based NegEx approach, and GPT-4o. Our fine-tuned LLM achieves the highest overall accuracy (0.962), outperforming GPT-4o (0.901) and commercial APIs by a notable margin, particularly excelling in Present (+4.2%), Absent (+8.4%), and Hypothetical (+23.4%) assertions. Our DL-based models surpass commercial solutions in Conditional (+5.3%) and Associated-with-Someone-Else (+10.1%) categories, while the few-shot classifier offers a lightweight yet highly competitive alternative (0.929), making it ideal for resource-constrained environments. Integrated within Spark NLP, our models consistently outperform black-box commercial solutions while enabling scalable inference and seamless integration with medical NER, Relation Extraction, and Terminology Resolution. These results reinforce the importance of domain-adapted, transparent, and customizable clinical NLP solutions over general-purpose LLMs and proprietary APIs.

Beyond Negation Detection: Comprehensive Assertion Detection Models for Clinical NLP

TL;DR

This work broadens clinical NLP assertion detection beyond negation to a comprehensive, multi-category framework evaluated on real-world data. It compares diverse architectures—fine-tuned LLMs, Bi-LSTM DL, BERT-based sequence classification, few-shot transformers, and a rule-based contextual module—within a Spark NLP Healthcare pipeline, against NegEx and commercial APIs. The fine-tuned LLM achieves the highest accuracy (0.962) but at substantial computational cost, while domain-adapted DL and few-shot models deliver strong, cost-effective performance, with the Combined Pipeline providing a practical balance of accuracy and efficiency. Overall, the study demonstrates that small, domain-specific models can outperform black-box APIs in both accuracy and scalability, supporting robust, production-ready clinical assertion detection in healthcare NLP.

Abstract

Assertion status detection is a critical yet often overlooked component of clinical NLP, essential for accurately attributing extracted medical facts. Past studies have narrowly focused on negation detection, leading to underperforming commercial solutions such as AWS Medical Comprehend, Azure AI Text Analytics, and GPT-4o due to their limited domain adaptation. To address this gap, we developed state-of-the-art assertion detection models, including fine-tuned LLMs, transformer-based classifiers, few-shot classifiers, and deep learning (DL) approaches. We evaluated these models against cloud-based commercial API solutions, the legacy rule-based NegEx approach, and GPT-4o. Our fine-tuned LLM achieves the highest overall accuracy (0.962), outperforming GPT-4o (0.901) and commercial APIs by a notable margin, particularly excelling in Present (+4.2%), Absent (+8.4%), and Hypothetical (+23.4%) assertions. Our DL-based models surpass commercial solutions in Conditional (+5.3%) and Associated-with-Someone-Else (+10.1%) categories, while the few-shot classifier offers a lightweight yet highly competitive alternative (0.929), making it ideal for resource-constrained environments. Integrated within Spark NLP, our models consistently outperform black-box commercial solutions while enabling scalable inference and seamless integration with medical NER, Relation Extraction, and Terminology Resolution. These results reinforce the importance of domain-adapted, transparent, and customizable clinical NLP solutions over general-purpose LLMs and proprietary APIs.

Paper Structure

This paper contains 16 sections, 3 figures, 6 tables.

Figures (3)

  • Figure A1: The flow diagram of a Spark NLP pipeline. When we fit() on the pipeline with a Spark data frame, its text column is fed into the DocumentAssembler() transformer and a new column document is created as an initial entry point to Spark NLP for any Spark data frame. Then, its document column is fed into the SentenceDetector(), Tokenizer() and WordEmbeddings(). Now data is ready to be fed into NER models and then to the assertion model.
  • Figure A2: Example of GPT-4o prompt for detecting assertion status in medical records
  • Figure A3: Example of Fine-tuned LLM prompt for detecting assertion status in medical records