Table of Contents
Fetching ...

Foundation models for electronic health records: representation dynamics and transferability

Michael C. Burkhart, Bashar Ramadan, Zewei Liao, Kaveri Chhikara, Juan C. Rojas, William F. Parker, Brett K. Beaulieu-Jones

TL;DR

This study investigates the transferability of foundation models trained on MIMIC-IV EHR data to a different health system (UCMC) by examining representation dynamics, outlier detection, and outcome-specific fine-tuning. The authors train a 1B-parameter FM with a self-supervised next-token objective, extract 24-hour representations, and evaluate both representation-based classifiers and supervised fine-tuning for four clinically relevant outcomes. They find substantial cross-site degradation without adaptation, but demonstrate that fine-tuning—especially with some local target-domain data—substantially improves performance, particularly for ICU admission and IMV prediction. Across datasets, representation trajectory features correlate with adverse outcomes, suggesting that analyzing clinical latent-space dynamics can inform early risk stratification and model deployment in diverse healthcare settings.

Abstract

Foundation models (FMs) trained on electronic health records (EHRs) have shown strong performance on a range of clinical prediction tasks. However, adapting these models to local health systems remains challenging due to limited data availability and resource constraints. In this study, we investigated what these models learn and evaluated the transferability of an FM trained on MIMIC-IV to an institutional EHR dataset at the University of Chicago Medical Center. We assessed their ability to identify outlier patients and examined representation-space patient trajectories in relation to future clinical outcomes. We also evaluated the performance of supervised fine-tuned classifiers on both source and target datasets. Our findings offer insights into the adaptability of FMs across different healthcare systems, highlight considerations for their effective implementation, and provide an empirical analysis of the underlying factors that contribute to their predictive performance.

Foundation models for electronic health records: representation dynamics and transferability

TL;DR

This study investigates the transferability of foundation models trained on MIMIC-IV EHR data to a different health system (UCMC) by examining representation dynamics, outlier detection, and outcome-specific fine-tuning. The authors train a 1B-parameter FM with a self-supervised next-token objective, extract 24-hour representations, and evaluate both representation-based classifiers and supervised fine-tuning for four clinically relevant outcomes. They find substantial cross-site degradation without adaptation, but demonstrate that fine-tuning—especially with some local target-domain data—substantially improves performance, particularly for ICU admission and IMV prediction. Across datasets, representation trajectory features correlate with adverse outcomes, suggesting that analyzing clinical latent-space dynamics can inform early risk stratification and model deployment in diverse healthcare settings.

Abstract

Foundation models (FMs) trained on electronic health records (EHRs) have shown strong performance on a range of clinical prediction tasks. However, adapting these models to local health systems remains challenging due to limited data availability and resource constraints. In this study, we investigated what these models learn and evaluated the transferability of an FM trained on MIMIC-IV to an institutional EHR dataset at the University of Chicago Medical Center. We assessed their ability to identify outlier patients and examined representation-space patient trajectories in relation to future clinical outcomes. We also evaluated the performance of supervised fine-tuned classifiers on both source and target datasets. Our findings offer insights into the adaptability of FMs across different healthcare systems, highlight considerations for their effective implementation, and provide an empirical analysis of the underlying factors that contribute to their predictive performance.

Paper Structure

This paper contains 23 sections, 5 figures, 13 tables.

Figures (5)

  • Figure 1: In (a), our initial training process packed sequences together, allowing one sequence to bleed into the next example within a batch. The dark goldenrod boundary outlines tokens corresponding to two individual hospitalization events. We insert a variable number of padding tokens between sequences to expose the model to padding. For the initial training, the model attempted to predict the next token in a sequence given the previous tokens ('context'). In (b), we performed supervised fine-tuning with left-padded sequences. Each hospitalization event (truncated at 24 hours) occupies a single training instance and is paired with its associated subsequent outcome. In this way, fine-tuning is outcome-specific.
  • Figure 2: The tokenization process converts information and events associated to a hosptialization into a sequence of integers. (a) Category-value tokenization iterates over all categories present in a table and learns deciles for the values within each category. In this example, we see how the vital corresponding to temperature in Celsius is assigned the label '33.' All measurements of temperature in the MIMIC training set are used to determine deciles for measurements within this category. For hospitalization 42, the tokens '33' for this category and then '0' for the corresponding deciled measurement would be inserted into the timeline at 'E1'. In (b), we see the anatomy of a basic timeline, starting with a start token, including some information about the patient, the admission, and then a series of measurements inserted in chronological order describing their visit, followed by a discharge token, and a token for timeline end.
  • Figure 3: We plot the first two PCA components for the embeddings corresponding to ( a) all tokens, colored by token type and ( b) the ten quantile tokens. In ( a), we note that tokens corresponding to similar categories tend to be grouped within the embedding. For ( b), we note that the model successfully learned the relative ordering of the deciles.
  • Figure 4: For 100 timelines corresponding to inpatient mortality and for 100 timelines that do not, we plot the mean and a 95% quantile range for inpatient mortality predictions for the first $i$ tokens. On the left, we have results for MIMIC and on the right are results for UCMC. Subfigures ( a) and ( b) correspond to predictions from supervised fine-tuning (SFT) on the MIMIC training sequences on ( a) the MIMIC test set and ( b) the UCMC test set. Subfigures ( c) and ( d) correspond to predictions from supervised fine-tuning (SFT) on MIMIC training sequences with uniform random truncation (URT). Subfigures ( e) and ( f) correspond to predictions from logistic regression classifiers trained on representations from the original model (no fine-tuning) extracted from sequences with uniform random truncation.
  • Figure 5: Our code is organized logically as shown above. Running the provided slurm scripts in this order (with access to compute nodes containing 8 Nvidia A100 GPUs with two 16-core 3.0-GHz AMD Milan processors) with the provided requirements file on MIMIC data and the UCMC dataset converted to the CLIF format produces the results contained in this manuscript.