Table of Contents
Fetching ...

Tokenization Tradeoffs in Structured EHR Foundation Models

Lin Lawrence Guo, Santiago Eduardo Arciniegas, Joseph Jihyung Lee, Adam Paul Yan, George Tomlinson, Jason Fries, Lillian Sung

Abstract

Foundation models for structured electronic health records (EHRs) are pretrained on longitudinal sequences of timestamped clinical events to learn adaptable patient representations. Tokenization -- how these timelines are converted into discrete model inputs -- determines what information is preserved, how efficiently it is encoded, and which relationships must be learned versus precomputed. Yet the impact of tokenization design choices on downstream performance and computational efficiency remains largely unexplored. Here, we pretrained a transformer on pediatric EHR data under a factorial design, varying tokenization along event encoding, time encoding, and workflow annotation. We evaluated area-under-the-receiver-operating-characteristic curve across 74 clinical prediction tasks. Joint event encoding and positional time encoding outperformed their alternatives (73/74 and 71/74 tasks) while requiring 39.5% and 9.6% fewer pretraining floating-point operations, respectively. Targeted ablations traced the joint encoding advantage to local binding efficiency, that is, code-attribute pairs are combined into single tokens, rather than split across tokens that the model must learn to associate during pretraining. External evaluation on an adult intensive care unit cohort demonstrated that this advantage generalizes despite substantial vocabulary mismatch, while temporal and workflow effects remain institution-specific. These results establish tokenization as a tractable lever for improving both the performance and efficiency of EHR foundation models.

Tokenization Tradeoffs in Structured EHR Foundation Models

Abstract

Foundation models for structured electronic health records (EHRs) are pretrained on longitudinal sequences of timestamped clinical events to learn adaptable patient representations. Tokenization -- how these timelines are converted into discrete model inputs -- determines what information is preserved, how efficiently it is encoded, and which relationships must be learned versus precomputed. Yet the impact of tokenization design choices on downstream performance and computational efficiency remains largely unexplored. Here, we pretrained a transformer on pediatric EHR data under a factorial design, varying tokenization along event encoding, time encoding, and workflow annotation. We evaluated area-under-the-receiver-operating-characteristic curve across 74 clinical prediction tasks. Joint event encoding and positional time encoding outperformed their alternatives (73/74 and 71/74 tasks) while requiring 39.5% and 9.6% fewer pretraining floating-point operations, respectively. Targeted ablations traced the joint encoding advantage to local binding efficiency, that is, code-attribute pairs are combined into single tokens, rather than split across tokens that the model must learn to associate during pretraining. External evaluation on an adult intensive care unit cohort demonstrated that this advantage generalizes despite substantial vocabulary mismatch, while temporal and workflow effects remain institution-specific. These results establish tokenization as a tractable lever for improving both the performance and efficiency of EHR foundation models.
Paper Structure (26 sections, 2 equations, 7 figures, 15 tables)

This paper contains 26 sections, 2 equations, 7 figures, 15 tables.

Figures (7)

  • Figure 1: Tokenization and experiment design. (A) Tokenization design choices. Event encoding determines how clinical events are represented: joint encoding creates a single token combining the event code and its attributes, while factorized encoding uses separate tokens for the code and each attribute. Time encoding determines how temporal information is captured: Time-Positions uses rotary positional embeddings (RoPE) on patient age-in-days, while Time-Tokens inserts discrete interval tokens between events with sequential integer positions. Workflow stage determines whether clinical workflow is included: with workflow, a single lab test generates separate events at order, collection, and result times; without workflow, only the result event is retained. (B) Experimental design. (1) Source data from SickKids including all patient events and workflow stages, (2) tokenization configurations based on the factorial design, (3) next-token-prediction pretraining of one transformer model per tokenization configuration, (4) local evaluation settings using 74 clinical prediction tasks, and (5) external evaluation using MIMIC across 13 tasks, with models trained from scratch on MIMIC serving as an upper-bound reference. Abbreviations: LOINC -- Logical Observation Identifiers Names and Codes; RoPE -- rotary positional embeddings; ORD -- order; COL -- collection; EHR -- electronic health records; MEDS -- medical event data standard.
  • Figure 2: Effect of tokenization design choices on task performance and pretraining cost. (A) Task-specific differences in AUROC between paired tokenization strategies across 74 clinical prediction tasks evaluated on the SickKids dataset. Each transparent point represents the AUROC difference for a single task under a specific experimental configuration. Opaque points denote the mean AUROC difference for each task, averaged across all other experimental factors. Background bands indicate task family. Absolute AUROCs by tokenization condition and task are reported in Supplementary Tables S8 and S9, respectively. (B) Relative difference in pretraining compute, measured as FLOPs between paired tokenization strategies. Bars indicate the mean percentage reduction across configurations, with error bars showing the range observed across experimental settings. Abbreviations: AUROC -- area under the receiver operating characteristic curve; FLOPs -- floating-point operations.
  • Figure 3: Event encoding ablation. Mean AUROC for joint (blue) and factorized (red) event encoding as a function of information content (Code Only, +Attributes, +Workflow, Full), under fixed-length (solid line) and fixed-event (dashed line) sequence regimes. Time-Positions was used for time encoding. Error bars are omitted because they do not reflect variability in between-condition contrasts. Abbreviation: AUROC -- area under the receiver operating characteristic curve.
  • Figure 4: Time encoding ablation. (A) Task-specific AUROC differences relative to Order-Only for three explicit time encoding strategies. Joint event encoding was used for each condition. Each transparent point represents the AUROC difference for individual evaluations. Opaque points denote the mean AUROC difference. Diamond color indicates time encoding strategy. Background bands indicate task family. (B) Mean AUROC by time encoding strategy. Error bars are omitted in the bar graph because they do not reflect variability in between-encoding contrasts. Abbreviations: AUROC -- area under the receiver operating characteristic curve.
  • Figure 5: External evaluation of SickKids foundation model on MIMIC. (A) Out-of-vocabulary (OOV) rates when applying tokenizers learned during SickKids pretraining to the full MIMIC dataset, stratified by event type. (B) AUROC differences between tokenization strategies for frozen SickKids-pretrained models evaluated on 13 MIMIC clinical prediction tasks. Each transparent point represents the AUROC difference for a single task under a specific experimental configuration. Opaque points denote the mean AUROC difference for each task, averaged across all other experimental factors. Background bands indicate task family. Absolute AUROCs by tokenization condition and task are reported in Supplementary Tables S14 and S15, respectively.
  • ...and 2 more figures