Table of Contents
Fetching ...

Large Language Models are Powerful Electronic Health Record Encoders

Stefan Hegselmann, Georg von Arnim, Tillmann Rheude, Noel Kronenberg, David Sontag, Gerhard Hindricks, Roland Eils, Benjamin Wild

TL;DR

This work investigates repurposing general-purpose Large Language Models (LLMs) to encode Electronic Health Records (EHRs) by serializing structured records into text, enabling high-dimensional embeddings that drive clinical predictions without private, institution-specific training data. The approach is evaluated on the EHRSHOT benchmark and an external UK Biobank cohort, showing that LLM embeddings can match or surpass a domain-specific EHR foundation model (CLMBR-T-Base) across multiple tasks, especially under domain shifts and low-data regimes. Key findings include the robustness of Markdown-style EHR serialization across formats, the benefit of focusing on recent history with extended context windows, and the complementary potential of combining LLM embeddings with domain-specific representations. The results argue for scalable, interoperable EHR encoders that leverage broad text pretraining, enabling cross-institution applicability and flexible integration with standard discriminative heads, while highlighting tradeoffs in computation, calibration, and the need for broader external validation.

Abstract

Electronic Health Records (EHRs) offer considerable potential for clinical prediction, but their complexity and heterogeneity present significant challenges for traditional machine learning methods. Recently, domain-specific EHR foundation models trained on large volumes of unlabeled EHR data have shown improved predictive accuracy and generalization. However, their development is constrained by limited access to diverse, high-quality datasets, and inconsistencies in coding standards and clinical practices. In this study, we explore the use of general-purpose Large Language Models (LLMs) to encode EHR into high-dimensional representations for downstream clinical prediction tasks. We convert structured EHR data into Markdown-formatted plain-text documents by replacing medical codes with natural language descriptions. This enables the use of LLMs and their extensive semantic understanding and generalization capabilities as effective encoders of EHRs without requiring access to private medical training data. We show that LLM-based embeddings can often match or even surpass the performance of a specialized EHR foundation model, CLMBR-T-Base, across 15 diverse clinical tasks from the EHRSHOT benchmark. Critically, our approach requires no institution-specific training and can incorporate any medical code with a text description, whereas existing EHR foundation models operate on fixed vocabularies and can only process codes seen during pretraining. To demonstrate generalizability, we further evaluate the approach on the UK Biobank (UKB) cohort, out-of-domain for CLMBR-T-Base, whose fixed vocabulary covers only 16% of UKB codes. Notably, an LLM-based model achieves superior performance for prediction of disease onset, hospitalization, and mortality, indicating robustness to population and coding shifts.

Large Language Models are Powerful Electronic Health Record Encoders

TL;DR

This work investigates repurposing general-purpose Large Language Models (LLMs) to encode Electronic Health Records (EHRs) by serializing structured records into text, enabling high-dimensional embeddings that drive clinical predictions without private, institution-specific training data. The approach is evaluated on the EHRSHOT benchmark and an external UK Biobank cohort, showing that LLM embeddings can match or surpass a domain-specific EHR foundation model (CLMBR-T-Base) across multiple tasks, especially under domain shifts and low-data regimes. Key findings include the robustness of Markdown-style EHR serialization across formats, the benefit of focusing on recent history with extended context windows, and the complementary potential of combining LLM embeddings with domain-specific representations. The results argue for scalable, interoperable EHR encoders that leverage broad text pretraining, enabling cross-institution applicability and flexible integration with standard discriminative heads, while highlighting tradeoffs in computation, calibration, and the need for broader external validation.

Abstract

Electronic Health Records (EHRs) offer considerable potential for clinical prediction, but their complexity and heterogeneity present significant challenges for traditional machine learning methods. Recently, domain-specific EHR foundation models trained on large volumes of unlabeled EHR data have shown improved predictive accuracy and generalization. However, their development is constrained by limited access to diverse, high-quality datasets, and inconsistencies in coding standards and clinical practices. In this study, we explore the use of general-purpose Large Language Models (LLMs) to encode EHR into high-dimensional representations for downstream clinical prediction tasks. We convert structured EHR data into Markdown-formatted plain-text documents by replacing medical codes with natural language descriptions. This enables the use of LLMs and their extensive semantic understanding and generalization capabilities as effective encoders of EHRs without requiring access to private medical training data. We show that LLM-based embeddings can often match or even surpass the performance of a specialized EHR foundation model, CLMBR-T-Base, across 15 diverse clinical tasks from the EHRSHOT benchmark. Critically, our approach requires no institution-specific training and can incorporate any medical code with a text description, whereas existing EHR foundation models operate on fixed vocabularies and can only process codes seen during pretraining. To demonstrate generalizability, we further evaluate the approach on the UK Biobank (UKB) cohort, out-of-domain for CLMBR-T-Base, whose fixed vocabulary covers only 16% of UKB codes. Notably, an LLM-based model achieves superior performance for prediction of disease onset, hospitalization, and mortality, indicating robustness to population and coding shifts.

Paper Structure

This paper contains 39 sections, 17 figures, 15 tables.

Figures (17)

  • Figure 1: Study Overview. (a) ehr foundation models are pretrained on unlabeled ehr data. Common unsupervised learning tasks are masked code or next code prediction. To obtain a representation for an ehr, we use the hidden states of the pretrained model. (b) llm are pretrained on vast amounts of text data. To obtain an llm embedding model, architectural changes are applied, and contrastive learning is used to improve representational performance. To obtain an ehr embedding, the data is first serialized as text and then processed by the llm embedding model. Again, we use the hidden states for the embedding. (c) We use the EHRSHOT benchmark and the ukb cohort for our experiments. Medical events of each patient are converted into numerical embeddings using an ehr foundation model and an llm embedding model, respectively. A logistic regression (LR) model is trained, validated, and tested for each clinical prediction task. We also test a gbm prediction model for the count-based baseline. Icons from flaticon.com.
  • Figure 2: Example EHR Text Serialization. The ehr data is serialized into plain text to apply LLM embedding models. We use Markdown formatting and prioritize relevant medical information. All dates were normalized relative to a reference date. Next, the patient’s demographics are listed. Time-series data coded via loinc was aggregated into 24 key concepts listed with the last three values, units, and classifications into low, normal, and high. Then, a list of all visits and all concepts not associated with a visit are given. Lastly, detailed visit entries beginning with the most recent are listed. Unique concepts are categorized into conditions, medications, and procedures. The last three values of a concept are given when present.
  • Figure 3: Scaling Behavior on EHRSHOT. Number of model parameters (x-axis) and macro-averaged auroc performance with bootstrapped 95% confidence intervals across all four task groups (y-axis). We include only models with varying sizes. The performance results of Qwen3- and Qwen2-based LLM embedding models suggest a scaling behavior with model size. Encoder-only models based on the BERT architecture do not show this trend. The specialized EHR foundation model, CLMBR-T-Base, is the most parameter-efficient model. Full results in \ref{['tab:ehrshot_performance_on_all_examples_full']}.
  • Figure 4: Few-Shot Performance on EHRSHOT. Mean auroc performance across subtasks for four task groups (bold). Blurred lines are averaged auroc values across five bootstrapped runs using different seeds wornow_ehrshot_2023. Similar to the ehr foundation model, CLMBR-T-Base, the LLM embedding models show the largest performance gains over the count-based model at intermediate numbers of training examples. With an increased number of training examples, the advantage of pretrained LLM-based models decreases.
  • Figure 6: Effects of EHR Serialization Components on EHRSHOT. Mean auroc performance and bootstrapped 95% confidence intervals for Qwen3-Emb-8B using a 4096-token limit imposed by computational constraints. The default Markdown EHR serialization (Full EHR) appears at the top, followed by runs with a generic and an empty instruction (orange). We then evaluate the serialization by removing specific components (green) and by retaining only individual components (red). Full results are reported in \ref{['tab:ehrshot_ehr_serialization_ablation_experiments_for_llm_embeddings_models']}.
  • ...and 12 more figures