Causality extraction from medical text using Large Language Models (LLMs)

Seethalakshmi Gopalakrishnan; Luciana Garbayo; Wlodek Zadrozny

Causality extraction from medical text using Large Language Models (LLMs)

Seethalakshmi Gopalakrishnan, Luciana Garbayo, Wlodek Zadrozny

TL;DR

This work tackles extracting cause–effect relations from Clinical Practice Guidelines, focusing on gestational diabetes. It compares multiple BERT-based models (BioBERT, DistilBERT, BERT) with Large Language Models (GPT-4, LLAMA2) on a newly annotated corpus of CPGs, showing BioBERT achieving the highest average F1 around 0.72 while GPT-4 and LLAMA2 offer complementary strengths but with stability and coverage limitations. The authors analyze annotation reliability (inter-annotator agreement) and provide the dataset and code publicly, underscoring the practicality of fine-tuned BERT approaches for medical causality extraction. The findings suggest that, despite the rise of LLMs, domain-adapted, fine-tuned transformers currently deliver the most reliable performance for extracting causal statements from medical guidelines, with LLMs offering potential gains in broader or less structured contexts when additional data is available. This work lays the groundwork for benchmark datasets and reproducible evaluation in medical causality extraction, with implications for guideline comparison, clinical decision support, and patient care.

Abstract

This study explores the potential of natural language models, including large language models, to extract causal relations from medical texts, specifically from Clinical Practice Guidelines (CPGs). The outcomes causality extraction from Clinical Practice Guidelines for gestational diabetes are presented, marking a first in the field. We report on a set of experiments using variants of BERT (BioBERT, DistilBERT, and BERT) and using Large Language Models (LLMs), namely GPT-4 and LLAMA2. Our experiments show that BioBERT performed better than other models, including the Large Language Models, with an average F1-score of 0.72. GPT-4 and LLAMA2 results show similar performance but less consistency. We also release the code and an annotated a corpus of causal statements within the Clinical Practice Guidelines for gestational diabetes.

Causality extraction from medical text using Large Language Models (LLMs)

TL;DR

Abstract

Paper Structure (15 sections, 2 figures, 7 tables)

This paper contains 15 sections, 2 figures, 7 tables.

Introduction
Related Work
Work related to automatic information extraction from Clinical Practice Guidelines (CPG)
Recent work on causality extraction from non-medical text
Causality extraction from the medical text
Data
Inter-annotator agreement for the medical data
Data preparation and preprocessing
Methodology
Causality extraction using BERT
Observations on using GPT-4 for causality extraction from medical guidelines
Results & Experiments
LLAMA2 for causality extraction from medical guidelines
Discussion
Conclusion

Figures (2)

Figure 1: Distribution of the labels in the corpus. The percentage of almost all the labels is around 24%.
Figure 2: Graph showing the train and validation loss when fine-tuning on BioBERT. Looking at the graph, we can understand that with the increase in the number of epochs, the training loss is constantly decreasing and approaching 0. The validation loss decreases till 16 epochs and then starts to increase. Based on this, we fine-tuned BioBERT for 16 epochs.

Causality extraction from medical text using Large Language Models (LLMs)

TL;DR

Abstract

Causality extraction from medical text using Large Language Models (LLMs)

Authors

TL;DR

Abstract

Table of Contents

Figures (2)