Table of Contents
Fetching ...

Prompting Large Language Models for Zero-Shot Clinical Prediction with Structured Longitudinal Electronic Health Record Data

Yinghao Zhu, Zixiang Wang, Junyi Gao, Yuning Tong, Jingkun An, Weibin Liao, Ewen M. Harrison, Liantao Ma, Chengwei Pan

TL;DR

This study investigates zero-shot clinical prediction by grounding structured longitudinal EHR data for large language models through a five-element prompting framework (role, instruction, clinical context, input data, output indicator). By systematically addressing data-level, task-level, and model-level questions, the authors show that incorporating longitudinal representations, sparsity handling, and knowledge-infused context (units and reference ranges) substantially improves predictive performance for mortality, length-of-stay, and readmission on TJH and MIMIC-IV, with GPT-4 achieving notable gains over traditional ML/DL baselines in few-shot or zero-shot settings. Across three research threads, the work demonstrates that feature-wise input formats and context-rich prompts enhance accuracy and reduce decoding failures, while temporal sensitivity appears limited for some ICU predictions. The findings highlight the potential of LLMs to support rapid, data-scarce clinical decision-making during emerging diseases, and provide a reproducible prompting framework and resources for future research in healthcare AI.

Abstract

The inherent complexity of structured longitudinal Electronic Health Records (EHR) data poses a significant challenge when integrated with Large Language Models (LLMs), which are traditionally tailored for natural language processing. Motivated by the urgent need for swift decision-making during new disease outbreaks, where traditional predictive models often fail due to a lack of historical data, this research investigates the adaptability of LLMs, like GPT-4, to EHR data. We particularly focus on their zero-shot capabilities, which enable them to make predictions in scenarios in which they haven't been explicitly trained. In response to the longitudinal, sparse, and knowledge-infused nature of EHR data, our prompting approach involves taking into account specific EHR characteristics such as units and reference ranges, and employing an in-context learning strategy that aligns with clinical contexts. Our comprehensive experiments on the MIMIC-IV and TJH datasets demonstrate that with our elaborately designed prompting framework, LLMs can improve prediction performance in key tasks such as mortality, length-of-stay, and 30-day readmission by about 35\%, surpassing ML models in few-shot settings. Our research underscores the potential of LLMs in enhancing clinical decision-making, especially in urgent healthcare situations like the outbreak of emerging diseases with no labeled data. The code is publicly available at https://github.com/yhzhu99/llm4healthcare for reproducibility.

Prompting Large Language Models for Zero-Shot Clinical Prediction with Structured Longitudinal Electronic Health Record Data

TL;DR

This study investigates zero-shot clinical prediction by grounding structured longitudinal EHR data for large language models through a five-element prompting framework (role, instruction, clinical context, input data, output indicator). By systematically addressing data-level, task-level, and model-level questions, the authors show that incorporating longitudinal representations, sparsity handling, and knowledge-infused context (units and reference ranges) substantially improves predictive performance for mortality, length-of-stay, and readmission on TJH and MIMIC-IV, with GPT-4 achieving notable gains over traditional ML/DL baselines in few-shot or zero-shot settings. Across three research threads, the work demonstrates that feature-wise input formats and context-rich prompts enhance accuracy and reduce decoding failures, while temporal sensitivity appears limited for some ICU predictions. The findings highlight the potential of LLMs to support rapid, data-scarce clinical decision-making during emerging diseases, and provide a reproducible prompting framework and resources for future research in healthcare AI.

Abstract

The inherent complexity of structured longitudinal Electronic Health Records (EHR) data poses a significant challenge when integrated with Large Language Models (LLMs), which are traditionally tailored for natural language processing. Motivated by the urgent need for swift decision-making during new disease outbreaks, where traditional predictive models often fail due to a lack of historical data, this research investigates the adaptability of LLMs, like GPT-4, to EHR data. We particularly focus on their zero-shot capabilities, which enable them to make predictions in scenarios in which they haven't been explicitly trained. In response to the longitudinal, sparse, and knowledge-infused nature of EHR data, our prompting approach involves taking into account specific EHR characteristics such as units and reference ranges, and employing an in-context learning strategy that aligns with clinical contexts. Our comprehensive experiments on the MIMIC-IV and TJH datasets demonstrate that with our elaborately designed prompting framework, LLMs can improve prediction performance in key tasks such as mortality, length-of-stay, and 30-day readmission by about 35\%, surpassing ML models in few-shot settings. Our research underscores the potential of LLMs in enhancing clinical decision-making, especially in urgent healthcare situations like the outbreak of emerging diseases with no labeled data. The code is publicly available at https://github.com/yhzhu99/llm4healthcare for reproducibility.
Paper Structure (38 sections, 6 figures, 10 tables)

This paper contains 38 sections, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Proposed prompt template which incorporates five key elements in prompt engineering: role, instruction, clinical context, input data, and output indicator & Overall structure of the paper.
  • Figure 2: An example of LLM's analysis with two different input formats representing the longitudinality (Feature-wise & Visit-wise).
  • Figure 3: An example of LLM's analysis of a patient's health condition with different context.Red stands for incorrect analysis from LLM. Green stands for reasonable analysis from LLM. Blue stands for units of features. Orange stands for reference ranges of features.
  • Figure 4: Prompt of units and reference ranges of sampled features.
  • Figure 5: Density of prediction logits on TJH and MIMIC-IV datasets across three time spans: upon discharge, 1 month and 6 months post-discharge.
  • ...and 1 more figures