Clinical Trials Ontology Engineering with Large Language Models
Berkan Çakır
TL;DR
The paper tackles the challenge of converting rapidly growing clinical-trial results into usable knowledge by integrating LLMs with a dedicated ontology-merging pipeline. It presents a per-trial ontology extraction workflow using GPT3.5, GPT4, and Llama3 (8b/70b), followed by a novel merging strategy that achieves $O(n)$ merge time and $O( abla n)$ lookup via a sorted synonym list, enabling scalable, real-time data integration. Through practical evaluation and OQuaRE-based quality assessment, the study finds that chained prompting generally improves information extraction, with GPT4 approaching human-level performance in some settings, though issues like missing prefixes can reduce validity. The work demonstrates meaningful cost and time savings over manual curation and discusses extrapolated large-scale implications, arguing for LLM-assisted clinical-trial ontology engineering as a practical path toward real-time medical knowledge integration.
Abstract
Managing clinical trial information is currently a significant challenge for the medical industry, as traditional methods are both time-consuming and costly. This paper proposes a simple yet effective methodology to extract and integrate clinical trial data in a cost-effective and time-efficient manner. Allowing the medical industry to stay up-to-date with medical developments. Comparing time, cost, and quality of the ontologies created by humans, GPT3.5, GPT4, and Llama3 (8b & 70b). Findings suggest that large language models (LLM) are a viable option to automate this process both from a cost and time perspective. This study underscores significant implications for medical research where real-time data integration from clinical trials could become the norm.
