Evaluating LLM Abilities to Understand Tabular Electronic Health Records: A Comprehensive Study of Patient Data Extraction and Retrieval
Jesus Lovon, Martin Mouysset, Jo Oleiwan, Jose G. Moreno, Christine Damase-Michel, Lynda Tamine
TL;DR
This study investigates how prompting strategies and EHR representations influence large language models' ability to extract and retrieve patient data from tabular electronic health records. By evaluating two backbone LLMs (Llama2 and Meditron) on the MIMICSQL-based tasks and introducing two new datasets (MIMIC$_{ask}$ and MIMIC$_{search}$), the work reveals that optimal feature selection and self-generated EHR descriptions can boost performance by substantial margins, while in-context learning yields modest gains for extraction and is less beneficial for retrieval. The authors derive practical guidelines for prompting LLMs in health search, highlighting that retrieval remains more challenging than extraction and that model expertise and serialization choices strongly affect outcomes. Overall, the paper provides a detailed, empirically grounded framework for designing LLM-based health data tools, with implications for health search applications and future research in EHR data-to-text systems.
Abstract
Electronic Health Record (EHR) tables pose unique challenges among which is the presence of hidden contextual dependencies between medical features with a high level of data dimensionality and sparsity. This study presents the first investigation into the abilities of LLMs to comprehend EHRs for patient data extraction and retrieval. We conduct extensive experiments using the MIMICSQL dataset to explore the impact of the prompt structure, instruction, context, and demonstration, of two backbone LLMs, Llama2 and Meditron, based on task performance. Through quantitative and qualitative analyses, our findings show that optimal feature selection and serialization methods can enhance task performance by up to 26.79% compared to naive approaches. Similarly, in-context learning setups with relevant example selection improve data extraction performance by 5.95%. Based on our study findings, we propose guidelines that we believe would help the design of LLM-based models to support health search.
