Enhancing LLMs with Smart Preprocessing for EHR Analysis
Yixiang Qu, Yifan Dai, Shilin Yu, Pradham Tanikella, Travis Schrank, Trevor Hackman, Didong Li, Di Wu
TL;DR
This work tackles privacy and compute barriers to applying LLMs to EHR data by introducing a compact, locally deployable framework that uses regex-based preprocessing and Retrieval-Augmented Generation to filter and highlight disease-relevant content in clinical notes. By preprocessing long, unstructured notes, the approach enables smaller LLMs to perform metastasis phenotyping with strong sensitivity while preserving high specificity, demonstrated on private HNC and MIMIC-IV datasets under zero-/few-shot and fine-tuning scenarios. Key findings show that regex preprocessing often yields higher gains than RAG for small models, reduces input length, and accelerates processing, making privacy-preserving local deployment feasible. The work provides practical guidance for real-world clinical NLP applications, emphasizing preprocessing as a critical component to balance accuracy, privacy, and compute in resource-constrained healthcare environments.
Abstract
Large Language Models (LLMs) have demonstrated remarkable proficiency in natural language processing; however, their application in sensitive domains such as healthcare, especially in processing Electronic Health Records (EHRs), is constrained by limited computational resources and privacy concerns. This paper introduces a compact LLM framework optimized for local deployment in environments with stringent privacy requirements and restricted access to high-performance GPUs. Our approach leverages simple yet powerful preprocessing techniques, including regular expressions (regex) and Retrieval-Augmented Generation (RAG), to extract and highlight critical information from clinical notes. By pre-filtering long, unstructured text, we enhance the performance of smaller LLMs on EHR-related tasks. Our framework is evaluated using zero-shot and few-shot learning paradigms on both private and publicly available datasets (MIMIC-IV), with additional comparisons against fine-tuned LLMs on MIMIC-IV. Experimental results demonstrate that our preprocessing strategy significantly supercharges the performance of smaller LLMs, making them well-suited for privacy-sensitive and resource-constrained applications. This study offers valuable insights into optimizing LLM performance for local, secure, and efficient healthcare applications. It provides practical guidance for real-world deployment for LLMs while tackling challenges related to privacy, computational feasibility, and clinical applicability.
