Table of Contents
Fetching ...

Enhancing LLMs with Smart Preprocessing for EHR Analysis

Yixiang Qu, Yifan Dai, Shilin Yu, Pradham Tanikella, Travis Schrank, Trevor Hackman, Didong Li, Di Wu

TL;DR

This work tackles privacy and compute barriers to applying LLMs to EHR data by introducing a compact, locally deployable framework that uses regex-based preprocessing and Retrieval-Augmented Generation to filter and highlight disease-relevant content in clinical notes. By preprocessing long, unstructured notes, the approach enables smaller LLMs to perform metastasis phenotyping with strong sensitivity while preserving high specificity, demonstrated on private HNC and MIMIC-IV datasets under zero-/few-shot and fine-tuning scenarios. Key findings show that regex preprocessing often yields higher gains than RAG for small models, reduces input length, and accelerates processing, making privacy-preserving local deployment feasible. The work provides practical guidance for real-world clinical NLP applications, emphasizing preprocessing as a critical component to balance accuracy, privacy, and compute in resource-constrained healthcare environments.

Abstract

Large Language Models (LLMs) have demonstrated remarkable proficiency in natural language processing; however, their application in sensitive domains such as healthcare, especially in processing Electronic Health Records (EHRs), is constrained by limited computational resources and privacy concerns. This paper introduces a compact LLM framework optimized for local deployment in environments with stringent privacy requirements and restricted access to high-performance GPUs. Our approach leverages simple yet powerful preprocessing techniques, including regular expressions (regex) and Retrieval-Augmented Generation (RAG), to extract and highlight critical information from clinical notes. By pre-filtering long, unstructured text, we enhance the performance of smaller LLMs on EHR-related tasks. Our framework is evaluated using zero-shot and few-shot learning paradigms on both private and publicly available datasets (MIMIC-IV), with additional comparisons against fine-tuned LLMs on MIMIC-IV. Experimental results demonstrate that our preprocessing strategy significantly supercharges the performance of smaller LLMs, making them well-suited for privacy-sensitive and resource-constrained applications. This study offers valuable insights into optimizing LLM performance for local, secure, and efficient healthcare applications. It provides practical guidance for real-world deployment for LLMs while tackling challenges related to privacy, computational feasibility, and clinical applicability.

Enhancing LLMs with Smart Preprocessing for EHR Analysis

TL;DR

This work tackles privacy and compute barriers to applying LLMs to EHR data by introducing a compact, locally deployable framework that uses regex-based preprocessing and Retrieval-Augmented Generation to filter and highlight disease-relevant content in clinical notes. By preprocessing long, unstructured notes, the approach enables smaller LLMs to perform metastasis phenotyping with strong sensitivity while preserving high specificity, demonstrated on private HNC and MIMIC-IV datasets under zero-/few-shot and fine-tuning scenarios. Key findings show that regex preprocessing often yields higher gains than RAG for small models, reduces input length, and accelerates processing, making privacy-preserving local deployment feasible. The work provides practical guidance for real-world clinical NLP applications, emphasizing preprocessing as a critical component to balance accuracy, privacy, and compute in resource-constrained healthcare environments.

Abstract

Large Language Models (LLMs) have demonstrated remarkable proficiency in natural language processing; however, their application in sensitive domains such as healthcare, especially in processing Electronic Health Records (EHRs), is constrained by limited computational resources and privacy concerns. This paper introduces a compact LLM framework optimized for local deployment in environments with stringent privacy requirements and restricted access to high-performance GPUs. Our approach leverages simple yet powerful preprocessing techniques, including regular expressions (regex) and Retrieval-Augmented Generation (RAG), to extract and highlight critical information from clinical notes. By pre-filtering long, unstructured text, we enhance the performance of smaller LLMs on EHR-related tasks. Our framework is evaluated using zero-shot and few-shot learning paradigms on both private and publicly available datasets (MIMIC-IV), with additional comparisons against fine-tuned LLMs on MIMIC-IV. Experimental results demonstrate that our preprocessing strategy significantly supercharges the performance of smaller LLMs, making them well-suited for privacy-sensitive and resource-constrained applications. This study offers valuable insights into optimizing LLM performance for local, secure, and efficient healthcare applications. It provides practical guidance for real-world deployment for LLMs while tackling challenges related to privacy, computational feasibility, and clinical applicability.

Paper Structure

This paper contains 29 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: A diagram representing existing methods and our method to deal with the EHR dataset.
  • Figure 2: Flowchart of our proposed framework and evaluation process for two EHR datasets.
  • Figure 3: Comparison of classification results for the Private HNC dataset and MIMIC-IV dataset under different preprocessing conditions and LLM settings. (a) A bar chart displaying classification results for the Private HNC dataset under different preprocessing conditions using Gemma-7B-it. The x-axis represents processing methods (Regex and No Preprocessing), while the y-axis represents the proportion of correctly classified instances. The chart includes three shot settings (Zero-shot, Three-shot, Six-shot) across different time ranges (20 days, 30 days, and 40 days). (b) A bar chart illustrating LLM classification results for the MIMIC-IV dataset under different preprocessing conditions. The x-axis represents processing methods (Regex and No Preprocessing), and the y-axis represents the proportion of correctly classified instances. The chart compares classification accuracy across various LLM configurations, including Zero-shot, Three-shot, and Six-shot learning using Gemma-7B-it, as well as two classification models fine-tuned on Gemma-2B-it and Gemma-7B-it. Each column corresponds to a specific method, while rows distinguish between Patient and Hospital Admission categories.
  • Figure 4: Comparison of LLMs and preprocessing approaches under different shot settings based on the MIMIC-IV dataset. (a) A bar chart showing the comparison of three different language models (Gemma-7B-it, LLaMA-2-7B-Chat-Med, and Bio-Medical-LLaMA-3-8B) across three shot settings: Zero-shot, Three-shot, and Six-shot. The x-axis represents preprocessing methods, including Regex and No Preprocessing. The y-axis represents Sensitivity. Each row represents a different language model, while each facet represents a different shot setting. Bars are color-coded for different preprocessing methods, and error bars indicate the range of the results. (b) A bar chart comparing different preprocessing approaches under three kinds of shot settings (Zero-shot, Three-shot, Six-shot) for Gemma-7B-it. The x-axis represents preprocessing methods, including Regex, RAG (Top 2), RAG (Top 3), RAG (Top 6), and No Preprocessing. The y-axis represents Sensitivity. Each shot setting is displayed as a separate facet, and error bars indicate the range of the results.