GAMedX: Generative AI-based Medical Entity Data Extractor Using Large Language Models

Mohammed-Khalil Ghali; Abdelrahman Farrag; Hajar Sakai; Hicham El Baz; Yu Jin; Sarah Lam

GAMedX: Generative AI-based Medical Entity Data Extractor Using Large Language Models

Mohammed-Khalil Ghali, Abdelrahman Farrag, Hajar Sakai, Hicham El Baz, Yu Jin, Sarah Lam

TL;DR

GAMedX addresses the challenge of extracting structured medical information from unstructured clinical narratives by leveraging open-source LLMs within a prompt-engineered, unified NER framework that enforces structured outputs via Pydantic schemas. The approach uses in-context learning with Mistral 7B and Gemma 7B, preprocessing with LangChain, and strict post-processing, achieving near-perfect ROUGE scores on a synthetic Data 4 Good dataset and substantially lower but improved performance on VAERS, complemented by a semantic analysis using t-SNE with dual embeddings. The results demonstrate the potential of cost-effective, LLM-based information extraction for automated forms filling and EHR data processing, while highlighting dataset-specific challenges and terminology gaps. The work suggests that open-source LLMs can be practically deployed in healthcare settings, with future work focusing on expanding LLM options, other NLP tasks, and real-world deployment considerations to enhance scalability and privacy compliance.

Abstract

In the rapidly evolving field of healthcare and beyond, the integration of generative AI in Electronic Health Records (EHRs) represents a pivotal advancement, addressing a critical gap in current information extraction techniques. This paper introduces GAMedX, a Named Entity Recognition (NER) approach utilizing Large Language Models (LLMs) to efficiently extract entities from medical narratives and unstructured text generated throughout various phases of the patient hospital visit. By addressing the significant challenge of processing unstructured medical text, GAMedX leverages the capabilities of generative AI and LLMs for improved data extraction. Employing a unified approach, the methodology integrates open-source LLMs for NER, utilizing chained prompts and Pydantic schemas for structured output to navigate the complexities of specialized medical jargon. The findings reveal significant ROUGE F1 score on one of the evaluation datasets with an accuracy of 98\%. This innovation enhances entity extraction, offering a scalable, cost-effective solution for automated forms filling from unstructured data. As a result, GAMedX streamlines the processing of unstructured narratives, and sets a new standard in NER applications, contributing significantly to theoretical and practical advancements beyond the medical technology sphere.

GAMedX: Generative AI-based Medical Entity Data Extractor Using Large Language Models

TL;DR

Abstract

Paper Structure (21 sections, 6 equations, 8 figures, 3 tables)

This paper contains 21 sections, 6 equations, 8 figures, 3 tables.

Introduction
Literature review
Data
Dataset 1: Medical Transcripts (Data 4 Good Challenge)
Dataset 2: Vaccine Adverse Event Reporting System (VAERS)
Methodology
Loading and Preprocessing Data
Prompt Crafting & Pydantic Schema
Pre-trained Open-Source LLMs Used
Mistral 7B
Gemma 7B
In-Context Learning
Performance Evaluation
Results
Quantitative Analysis
...and 6 more sections

Figures (8)

Figure 1: Example of a patient-doctor dialogue with annotated data elements for NER, highlighting the extraction of patient names, medications, symptoms, conditions, and precautions.
Figure 2: Overview of LLM development methods – Pre-Training on diverse sources, Fine-Tuning, and Prompting.
Figure 3: Example of the prompt used
Figure 4: Benchmarks comparison of the LLMs used
Figure 5: Figures 6.a and 6.b: t-SNE plot of ground truth and model answers for Mistral and Gemma using one shot prompt.
...and 3 more figures

GAMedX: Generative AI-based Medical Entity Data Extractor Using Large Language Models

TL;DR

Abstract

GAMedX: Generative AI-based Medical Entity Data Extractor Using Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)