Table of Contents
Fetching ...

Evaluating LLM Abilities to Understand Tabular Electronic Health Records: A Comprehensive Study of Patient Data Extraction and Retrieval

Jesus Lovon, Martin Mouysset, Jo Oleiwan, Jose G. Moreno, Christine Damase-Michel, Lynda Tamine

TL;DR

This study investigates how prompting strategies and EHR representations influence large language models' ability to extract and retrieve patient data from tabular electronic health records. By evaluating two backbone LLMs (Llama2 and Meditron) on the MIMICSQL-based tasks and introducing two new datasets (MIMIC$_{ask}$ and MIMIC$_{search}$), the work reveals that optimal feature selection and self-generated EHR descriptions can boost performance by substantial margins, while in-context learning yields modest gains for extraction and is less beneficial for retrieval. The authors derive practical guidelines for prompting LLMs in health search, highlighting that retrieval remains more challenging than extraction and that model expertise and serialization choices strongly affect outcomes. Overall, the paper provides a detailed, empirically grounded framework for designing LLM-based health data tools, with implications for health search applications and future research in EHR data-to-text systems.

Abstract

Electronic Health Record (EHR) tables pose unique challenges among which is the presence of hidden contextual dependencies between medical features with a high level of data dimensionality and sparsity. This study presents the first investigation into the abilities of LLMs to comprehend EHRs for patient data extraction and retrieval. We conduct extensive experiments using the MIMICSQL dataset to explore the impact of the prompt structure, instruction, context, and demonstration, of two backbone LLMs, Llama2 and Meditron, based on task performance. Through quantitative and qualitative analyses, our findings show that optimal feature selection and serialization methods can enhance task performance by up to 26.79% compared to naive approaches. Similarly, in-context learning setups with relevant example selection improve data extraction performance by 5.95%. Based on our study findings, we propose guidelines that we believe would help the design of LLM-based models to support health search.

Evaluating LLM Abilities to Understand Tabular Electronic Health Records: A Comprehensive Study of Patient Data Extraction and Retrieval

TL;DR

This study investigates how prompting strategies and EHR representations influence large language models' ability to extract and retrieve patient data from tabular electronic health records. By evaluating two backbone LLMs (Llama2 and Meditron) on the MIMICSQL-based tasks and introducing two new datasets (MIMIC and MIMIC), the work reveals that optimal feature selection and self-generated EHR descriptions can boost performance by substantial margins, while in-context learning yields modest gains for extraction and is less beneficial for retrieval. The authors derive practical guidelines for prompting LLMs in health search, highlighting that retrieval remains more challenging than extraction and that model expertise and serialization choices strongly affect outcomes. Overall, the paper provides a detailed, empirically grounded framework for designing LLM-based health data tools, with implications for health search applications and future research in EHR data-to-text systems.

Abstract

Electronic Health Record (EHR) tables pose unique challenges among which is the presence of hidden contextual dependencies between medical features with a high level of data dimensionality and sparsity. This study presents the first investigation into the abilities of LLMs to comprehend EHRs for patient data extraction and retrieval. We conduct extensive experiments using the MIMICSQL dataset to explore the impact of the prompt structure, instruction, context, and demonstration, of two backbone LLMs, Llama2 and Meditron, based on task performance. Through quantitative and qualitative analyses, our findings show that optimal feature selection and serialization methods can enhance task performance by up to 26.79% compared to naive approaches. Similarly, in-context learning setups with relevant example selection improve data extraction performance by 5.95%. Based on our study findings, we propose guidelines that we believe would help the design of LLM-based models to support health search.
Paper Structure (23 sections, 2 figures, 5 tables)

This paper contains 23 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Illustration of the prompts used for the extraction and retrieval tasks, including Guided vs. Non-Guided instructions, and patient with txt (left) serializations.
  • Figure 2: (top) Example of random and query-based demonstrations in an ICL setup for extraction. (bottom) Example of ICL and zeroshot setup for retrieval. Highlighted the features (and values) referenced in demonstration and input.