Table of Contents
Fetching ...

Enhancing LLMs for Identifying and Prioritizing Important Medical Jargons from Electronic Health Record Notes Utilizing Data Augmentation

Won Seok Jang, Sharmin Sultana, Zonghai Yao, Hieu Tran, Zhichao Yang, Sunjae Kwon, Hong Yu

TL;DR

This study tackles the challenge of helping patients understand EHR notes by identifying and ranking medically relevant jargon. It systematically evaluates closed- and open-source LLMs under prompting strategies, LoRA-based fine-tuning, and data augmentation using MIMIC-IV, with a 5-fold cross-validated dataset of 106 expert-annotated notes. Key contributions include a first comprehensive open- and closed-source LLM comparison for patient-centric jargon prioritization, and a novel data-augmentation approach that enables smaller open-source models to outperform larger proprietary models. Findings show fine-tuning and augmentation generally improve performance, with GPT-4 Turbo achieving the top F1 (0.433) and Mistral7B with augmentation achieving the top MRR (0.746); importantly, open-source models can surpass closed-source counterparts in many configurations, especially in low-resource settings. The work advances practical strategies for deploying patient-facing terminology extraction in real-world healthcare contexts and informs prompt design, data augmentation, and model fine-tuning choices in resource-constrained environments.

Abstract

OpenNotes enables patients to access EHR notes, but medical jargon can hinder comprehension. To improve understanding, we evaluated closed- and open-source LLMs for extracting and prioritizing key medical terms using prompting, fine-tuning, and data augmentation. We assessed LLMs on 106 expert-annotated EHR notes, experimenting with (i) general vs. structured prompts, (ii) zero-shot vs. few-shot prompting, (iii) fine-tuning, and (iv) data augmentation. To enhance open-source models in low-resource settings, we used ChatGPT for data augmentation and applied ranking techniques. We incrementally increased the augmented dataset size (10 to 10,000) and conducted 5-fold cross-validation, reporting F1 score and Mean Reciprocal Rank (MRR). Our result show that fine-tuning and data augmentation improved performance over other strategies. GPT-4 Turbo achieved the highest F1 (0.433), while Mistral7B with data augmentation had the highest MRR (0.746). Open-source models, when fine-tuned or augmented, outperformed closed-source models. Notably, the best F1 and MRR scores did not always align. Few-shot prompting outperformed zero-shot in vanilla models, and structured prompts yielded different preferences across models. Fine-tuning improved zero-shot performance but sometimes degraded few-shot performance. Data augmentation performed comparably or better than other methods. Our evaluation highlights the effectiveness of prompting, fine-tuning, and data augmentation in improving model performance for medical jargon extraction in low-resource scenarios.

Enhancing LLMs for Identifying and Prioritizing Important Medical Jargons from Electronic Health Record Notes Utilizing Data Augmentation

TL;DR

This study tackles the challenge of helping patients understand EHR notes by identifying and ranking medically relevant jargon. It systematically evaluates closed- and open-source LLMs under prompting strategies, LoRA-based fine-tuning, and data augmentation using MIMIC-IV, with a 5-fold cross-validated dataset of 106 expert-annotated notes. Key contributions include a first comprehensive open- and closed-source LLM comparison for patient-centric jargon prioritization, and a novel data-augmentation approach that enables smaller open-source models to outperform larger proprietary models. Findings show fine-tuning and augmentation generally improve performance, with GPT-4 Turbo achieving the top F1 (0.433) and Mistral7B with augmentation achieving the top MRR (0.746); importantly, open-source models can surpass closed-source counterparts in many configurations, especially in low-resource settings. The work advances practical strategies for deploying patient-facing terminology extraction in real-world healthcare contexts and informs prompt design, data augmentation, and model fine-tuning choices in resource-constrained environments.

Abstract

OpenNotes enables patients to access EHR notes, but medical jargon can hinder comprehension. To improve understanding, we evaluated closed- and open-source LLMs for extracting and prioritizing key medical terms using prompting, fine-tuning, and data augmentation. We assessed LLMs on 106 expert-annotated EHR notes, experimenting with (i) general vs. structured prompts, (ii) zero-shot vs. few-shot prompting, (iii) fine-tuning, and (iv) data augmentation. To enhance open-source models in low-resource settings, we used ChatGPT for data augmentation and applied ranking techniques. We incrementally increased the augmented dataset size (10 to 10,000) and conducted 5-fold cross-validation, reporting F1 score and Mean Reciprocal Rank (MRR). Our result show that fine-tuning and data augmentation improved performance over other strategies. GPT-4 Turbo achieved the highest F1 (0.433), while Mistral7B with data augmentation had the highest MRR (0.746). Open-source models, when fine-tuned or augmented, outperformed closed-source models. Notably, the best F1 and MRR scores did not always align. Few-shot prompting outperformed zero-shot in vanilla models, and structured prompts yielded different preferences across models. Fine-tuning improved zero-shot performance but sometimes degraded few-shot performance. Data augmentation performed comparably or better than other methods. Our evaluation highlights the effectiveness of prompting, fine-tuning, and data augmentation in improving model performance for medical jargon extraction in low-resource scenarios.

Paper Structure

This paper contains 14 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: The evaluation workflow for closed and open-Source LLMs. We evaluate the performance of the LLMs in three distinctive settings. I. We assess the performance of closed- and open-source models by varying prompts and extraction tasks. II. Next, we fine-tune the open-source models using the same variations. III. Finally, we apply data augmentation to fine-tune the open-source LLMs and evaluate them under the same varying settings. Performance is measured using F1 and MRR scores through 5-fold cross-validation.
  • Figure 2: A sample EHR note where physicians identified important medical terms. Diagnoses/conditions are highlighted in yellow, while medications, tests and procedures associated with those diagnoses are marked in green, accompanied by their respective rankings
  • Figure 3: Case Study for Extracting the Top 3 Important Medical Jargons from Zero-shot and Few-shot Prompts in Mistral 7B. The few-shot prompting strategy demonstrates greater robustness in vanilla models compared to zero-shot prompting. The highlighted jargons represent terms that overlap with the expert-annotated labels, emphasizing the alignment between the model's outputs and domain experts' annotations.
  • Figure 4: Case Study for extracting Top 5 important medical jargons from BioMistral7B and BioMistral7B that was finetuned on some of the samples of the gold labeled dataset in Zero-shot prompt settings. The finetuned model shows more robustness than vanilla models, especially in Zero-shot prompts. The highlighted jargons are the ones that overlap with the expert-annotated labels.
  • Figure 5: Case Study for extracting Top 5 important medical jargons from Llama3.1 8B finetuned and Llama 3.1 8B that was finetuned on the MIMIC-IV augmented dataset in Zero-shot prompt settings. In many cases, augmented models shows comparable performance than vanilla models and finetuned models, especially in Zero-shot prompts. The highlighted jargons are the ones that overlap with the expert annotated labels.
  • ...and 3 more figures