Table of Contents
Fetching ...

Can we generate portable representations for clinical time series data using LLMs?

Zongliang Ji, Yifei Sun, Andre Amaral, Anna Goldenberg, Rahul G. Krishnan

Abstract

Deploying clinical ML is slow and brittle: models that work at one hospital often degrade under distribution shifts at the next. In this work, we study a simple question -- can large language models (LLMs) create portable patient embeddings i.e. representations of patients enable a downstream predictor built on one hospital to be used elsewhere with minimal-to-no retraining and fine-tuning. To do so, we map from irregular ICU time series onto concise natural language summaries using a frozen LLM, then embed each summary with a frozen text embedding model to obtain a fixed length vector capable of serving as input to a variety of downstream predictors. Across three cohorts (MIMIC-IV, HIRID, PPICU), on multiple clinically grounded forecasting and classification tasks, we find that our approach is simple, easy to use and competitive with in-distribution with grid imputation, self-supervised representation learning, and time series foundation models, while exhibiting smaller relative performance drops when transferring to new hospitals. We study the variation in performance across prompt design, with structured prompts being crucial to reducing the variance of the predictive models without altering mean accuracy. We find that using these portable representations improves few-shot learning and does not increase demographic recoverability of age or sex relative to baselines, suggesting little additional privacy risk. Our work points to the potential that LLMs hold as tools to enable the scalable deployment of production grade predictive models by reducing the engineering overhead.

Can we generate portable representations for clinical time series data using LLMs?

Abstract

Deploying clinical ML is slow and brittle: models that work at one hospital often degrade under distribution shifts at the next. In this work, we study a simple question -- can large language models (LLMs) create portable patient embeddings i.e. representations of patients enable a downstream predictor built on one hospital to be used elsewhere with minimal-to-no retraining and fine-tuning. To do so, we map from irregular ICU time series onto concise natural language summaries using a frozen LLM, then embed each summary with a frozen text embedding model to obtain a fixed length vector capable of serving as input to a variety of downstream predictors. Across three cohorts (MIMIC-IV, HIRID, PPICU), on multiple clinically grounded forecasting and classification tasks, we find that our approach is simple, easy to use and competitive with in-distribution with grid imputation, self-supervised representation learning, and time series foundation models, while exhibiting smaller relative performance drops when transferring to new hospitals. We study the variation in performance across prompt design, with structured prompts being crucial to reducing the variance of the predictive models without altering mean accuracy. We find that using these portable representations improves few-shot learning and does not increase demographic recoverability of age or sex relative to baselines, suggesting little additional privacy risk. Our work points to the potential that LLMs hold as tools to enable the scalable deployment of production grade predictive models by reducing the engineering overhead.
Paper Structure (44 sections, 2 equations, 7 figures, 45 tables)

This paper contains 44 sections, 2 equations, 7 figures, 45 tables.

Figures (7)

  • Figure 1: Motivation for Record2Vec. Numeric imputation loses clinical semantics and limits portability; human handoffs preserve meaning but are costly and variable. LLMs can create handoff-style summaries that retain semantics and provide portable inputs for forecasting and classification.
  • Figure 2: Methods to generate medical-record representations. Top to bottom: imputation pipeline; self-supervised TS representation (TSDE); TS foundation model (TimesFM); and Record2Vec: LLM summary followed by text embedding.
  • Figure 3: Rank distributions for No-summary vs. three LLM variants across 15 in-distribution tasks (left) and 30 cross-site transfer tasks (right). Methods are ranked based on performance across five downstream tasks: Forecast (MSE), LoS (MAE), Mortality (AUROC), Drug (Recall), and Lab (Recall). See the Appendix \ref{['app:rank_figure']} for the detailed values.
  • Figure 4: Rank distributions for four prompt variants with Gemini 2.0-flash across 15 in-distribution tasks (left) and 30 cross-site transfer tasks (right). Lower ranks indicate better performance. Methods are ranked based on performance across five downstream tasks: Forecast (MSE), LoS (MAE), Mortality (AUROC), Drug (Recall), and Lab (Recall). See the Appendix \ref{['app:rank_figure']} for the detailed values.
  • Figure 5: Few-shot finetuning with 16 labeled target samples for mortality prediction (RQ5) shown across six transfer settings. All tasks are reported the same metric as previous section: Forecast: masked mse, LOS: mae, Mortality: AUROC. The first row is the result of Hirid $\rightarrow$ Ppicu and the second row is Mimic $\rightarrow$ Ppicu. Reference lines: blue = best in-distribution upper bound, orange = best pre-finetune result, black = best finetuned baseline. Record2Vec surpassed baselines with a large gap, reaching comparable performance to in-distribution.
  • ...and 2 more figures