Table of Contents
Fetching ...

Context Clues: Evaluating Long Context Models for Clinical Prediction Tasks on EHRs

Michael Wornow, Suhana Bedi, Miguel Angel Fuentes Hernandez, Ethan Steinberg, Jason Alan Fries, Christopher Re, Sanmi Koyejo, Nigam H. Shah

TL;DR

This study systematically evaluates how context length affects clinical prediction using longitudinal, structured EHR data across four architectures (GPT, Llama, Mamba, Hyena). By pretraining on $2.5$ million patients and evaluating on the EHRSHOT benchmark with context lengths up to $L=16k$, the authors show that long-context models, especially Mamba, can achieve state-of-the-art performance on a majority of tasks and are more robust to EHR-specific properties like copy-forwarding, irregular inter-event intervals, and disease progression. The work also introduces quantitative metrics for EHR-specific challenges, analyzes perplexity dynamics over time, and provides a data-and-code release to support reproducibility and further research in long-context modeling for healthcare. Overall, the findings highlight the practical potential and limitations of long-context FMs in modeling lifetime patient trajectories, informing architecture choice and future directions for real-world deployment in hospitals.

Abstract

Foundation Models (FMs) trained on Electronic Health Records (EHRs) have achieved state-of-the-art results on numerous clinical prediction tasks. However, most existing EHR FMs have context windows of <1k tokens. This prevents them from modeling full patient EHRs which can exceed 10k's of events. Recent advancements in subquadratic long-context architectures (e.g., Mamba) offer a promising solution. However, their application to EHR data has not been well-studied. We address this gap by presenting the first systematic evaluation of the effect of context length on modeling EHR data. We find that longer context models improve predictive performance -- our Mamba-based model surpasses the prior state-of-the-art on 9/14 tasks on the EHRSHOT prediction benchmark. For clinical applications, however, model performance alone is insufficient -- robustness to the unique properties of EHR is crucial. Thus, we also evaluate models across three previously underexplored properties of EHR data: (1) the prevalence of "copy-forwarded" diagnoses which creates artificial repetition of tokens within EHR sequences; (2) the irregular time intervals between EHR events which can lead to a wide range of timespans within a context window; and (3) the natural increase in disease complexity over time which makes later tokens in the EHR harder to predict than earlier ones. Stratifying our EHRSHOT results, we find that higher levels of each property correlate negatively with model performance, but that longer context models are more robust to more extreme levels of these properties. Our work highlights the potential for using long-context architectures to model EHR data, and offers a case study for identifying new challenges in modeling sequential data motivated by domains outside of natural language. We release our models and code at: https://github.com/som-shahlab/long_context_clues

Context Clues: Evaluating Long Context Models for Clinical Prediction Tasks on EHRs

TL;DR

This study systematically evaluates how context length affects clinical prediction using longitudinal, structured EHR data across four architectures (GPT, Llama, Mamba, Hyena). By pretraining on million patients and evaluating on the EHRSHOT benchmark with context lengths up to , the authors show that long-context models, especially Mamba, can achieve state-of-the-art performance on a majority of tasks and are more robust to EHR-specific properties like copy-forwarding, irregular inter-event intervals, and disease progression. The work also introduces quantitative metrics for EHR-specific challenges, analyzes perplexity dynamics over time, and provides a data-and-code release to support reproducibility and further research in long-context modeling for healthcare. Overall, the findings highlight the practical potential and limitations of long-context FMs in modeling lifetime patient trajectories, informing architecture choice and future directions for real-world deployment in hospitals.

Abstract

Foundation Models (FMs) trained on Electronic Health Records (EHRs) have achieved state-of-the-art results on numerous clinical prediction tasks. However, most existing EHR FMs have context windows of <1k tokens. This prevents them from modeling full patient EHRs which can exceed 10k's of events. Recent advancements in subquadratic long-context architectures (e.g., Mamba) offer a promising solution. However, their application to EHR data has not been well-studied. We address this gap by presenting the first systematic evaluation of the effect of context length on modeling EHR data. We find that longer context models improve predictive performance -- our Mamba-based model surpasses the prior state-of-the-art on 9/14 tasks on the EHRSHOT prediction benchmark. For clinical applications, however, model performance alone is insufficient -- robustness to the unique properties of EHR is crucial. Thus, we also evaluate models across three previously underexplored properties of EHR data: (1) the prevalence of "copy-forwarded" diagnoses which creates artificial repetition of tokens within EHR sequences; (2) the irregular time intervals between EHR events which can lead to a wide range of timespans within a context window; and (3) the natural increase in disease complexity over time which makes later tokens in the EHR harder to predict than earlier ones. Stratifying our EHRSHOT results, we find that higher levels of each property correlate negatively with model performance, but that longer context models are more robust to more extreme levels of these properties. Our work highlights the potential for using long-context architectures to model EHR data, and offers a case study for identifying new challenges in modeling sequential data motivated by domains outside of natural language. We release our models and code at: https://github.com/som-shahlab/long_context_clues

Paper Structure

This paper contains 47 sections, 25 equations, 12 figures, 15 tables.

Figures (12)

  • Figure 1: The central claims of this paper. (a) EHRs are sequences: An EHR is simply a timeline of clinical events that occur to a patient, and thus can be naturally represented as a sequence of tokens. (b) Long context improves performance: AUROC on clinical prediction tasks tends to increase with longer context lengths, with Hyena (red) being the notable exception. Overall, Mamba (green) at a context length of 16k achieves the highest average AUROC across 14 diverse clinical prediction tasks. (c) EHR data has distinct properties: In contrast to natural language, EHR data has unique properties whose implications remain under-explored in the ML literature. Here, we highlight three such attributes -- copy-forwarding, irregular time intervals between tokens, and disease progression. (d) EHRs properties present unique modeling challenges: Stratifying patients by the degree to which they exhibit each EHR-specific property, we find that higher Brier scores (i.e., worse model performance) are associated with patients who have more repetitive (top) or irregular (middle) EHRs. Additionally, the perplexity of tokens later in a patient's timeline tends to be higher, even when conditioning on prior tokens (bottom).
  • Figure 2: EHR data exhibits a high degree of variation in time intervals between events. From left to right, we measure the mean, standard deviation, and inter-quartile range (IQR) of time intervals between events, reflecting the irregular timing of clinical interactions "EHR-OMOP" (blue) is the 0.5M patients in the EHR-OMOP validation set. The x-axis (log scale) represents the metric in seconds, ranging from $10^1$ to $10^9$. The y-axis measures the number of sequences with those values. Here, we focus on event intervals to capture the temporal structure of clinical encounters and highlight patterns in patient healthcare utilization.
  • Figure 3: EHR data exhibits a higher degree of repetition than natural language, as measured by $n$-gram repetition rates. From left to right, we measure $n =1, 2, 3, 4$. "EHR-OMOP" (blue) is the 0.5M patients in the EHR-OMOP validation dataset, "WikiText" (orange) is the WikiText-103 training dataset of high quality Wikipedia articles merity2016pointer. We analyze $n$-gram repetition at the event level to reflect the structure of recurring clinical events, capturing patterns unique to EHR data.The x-axis represents the $n$-gram repetition rate (i.e., percentage of $n$-grams that are repeated at least once within a sequence, where higher is more repetitive) and the y-axis shows the frequency of sequences with that repetition rate in each dataset.
  • Figure 4: Median perplexity (PPL) by token position for different models -- GPT (far left), Hyena (middle left), Llama (middle right), Mamba (far right) -- across varying context lengths (lines). The x-axis represents token position, and the y-axis shows the median PPL at each position measured across 20k EHR-OMOP patients. We analyze PPL by token rather than by event to capture the model's handling of the specific information content in each encoded token.Note that the upward trend in PPL is almost immediate, even within the first hundred tokens of each model's context window.
  • Figure 5: Distributions of patient data from the EHR-OMOP dataset across (A) training and (B) validation splits, showing both event-level and code-level counts. The x-axis is log-scaled to capture the wide range in the number of events per patient, the number of unique patients per code, and the distribution of events associated with each code.
  • ...and 7 more figures