Table of Contents
Fetching ...

Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction

Haresh Rengaraj Rajamohan, Xiang Gao, Weicheng Zhu, Shih-Lun Huang, Long Chen, Gabe Schulman, Huizhen Jin, Shengduo Li, Yixuan Wang, Huidi Yang, Kyunghyun Cho, Cem M. Deniz, Narges Razavian

Abstract

While large-scale pretraining has revolutionized language modeling, its potential remains underexplored in healthcare with structured electronic health records (EHRs). We present RAVEN, a novel generative pretraining strategy for sequential EHR data based on Recurrence-Aware next-Visit EveNt prediction. Leveraging a dataset of over one million unique individuals, our model learns to autoregressively generate tokenized clinical events for the next visit conditioned on patient history. We introduce regularization on predicting repeated events and highlight a key pitfall in EHR-based foundation model evaluations: repeated event tokens can inflate performance metrics when new onsets are not distinguished from subsequent occurrences. Furthermore, we empirically investigate the scaling behaviors in a data-constrained, compute-saturated regime, showing that simply increasing model size is suboptimal without commensurate increases in data volume. We evaluate our model via zero-shot prediction for forecasting the incidence of a diverse set of diseases, where it rivals fully fine-tuned representation-based Transformer models and outperforms widely used simulation-based next-token approaches. Finally, without additional parameter updates, we show that RAVEN can generalize to an external patient cohort under lossy clinical code mappings and feature coverage gaps.

Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction

Abstract

While large-scale pretraining has revolutionized language modeling, its potential remains underexplored in healthcare with structured electronic health records (EHRs). We present RAVEN, a novel generative pretraining strategy for sequential EHR data based on Recurrence-Aware next-Visit EveNt prediction. Leveraging a dataset of over one million unique individuals, our model learns to autoregressively generate tokenized clinical events for the next visit conditioned on patient history. We introduce regularization on predicting repeated events and highlight a key pitfall in EHR-based foundation model evaluations: repeated event tokens can inflate performance metrics when new onsets are not distinguished from subsequent occurrences. Furthermore, we empirically investigate the scaling behaviors in a data-constrained, compute-saturated regime, showing that simply increasing model size is suboptimal without commensurate increases in data volume. We evaluate our model via zero-shot prediction for forecasting the incidence of a diverse set of diseases, where it rivals fully fine-tuned representation-based Transformer models and outperforms widely used simulation-based next-token approaches. Finally, without additional parameter updates, we show that RAVEN can generalize to an external patient cohort under lossy clinical code mappings and feature coverage gaps.

Paper Structure

This paper contains 27 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Study overview of recurrence-aware next-visit foundation modeling on longitudinal EHRs.a, Patient trajectories are represented as temporally ordered visits containing unordered clinical events. During pretraining, a time-coded separator token [SEP] predicts the full event set at the next visit; during zero-shot inference, the same interface queries disease risk at future horizons by appending a separator token at time $t+H$ and pooling logits over condition-specific code sets. b, Repeated chronic targets are downweighted during pretraining according to their prior count so that rare first onsets retain training signal. c, In a finite EHR corpus revisited for many epochs, model size must match the corresponding available data as larger models eventually overfit, and the selected 144M model lies near the full-data optimum. d, Model development uses a patient-level 70%/15%/15% split on patient data from NYU Langone; a schematic external zero-shot transfer setting based on Stanford EHRSHOT highlights ontology harmonization between benchmark concepts and the institutional token space, which can introduce information loss and drop of certain features.
  • Figure 2: Zero-shot generalization of RAVEN for new diagnosis predictions on EHRSHOT. All baseline models (CLMBR-T, GBM, Logistic Regression, Random Forest) are trained on varying numbers of labeled examples per class ($K$) drawn from the training dataset, whereas RAVEN is evaluated in a zero-shot setting with no target-domain supervision. a, AUROC as a function of the number of training examples per class. Solid lines denote baseline models with s.d. shading across random seeds; dashed horizontal lines indicate RAVEN performance at $K=0$ for three regularization strengths ($\lambda = 1.0, 0.5, 0.25$), with 95% confidence intervals shown at the zero-shot axis. RAVEN with different regularization strengths matches or exceeds baselines trained on different number of labeled examples for conditions including acute MI, hyperlipidaemia, and hypertension. b, Comparison to baselines at $K = \text{all}$ (full EHRSHOT training set). RAVEN can be competitive on certain conditions with fully finetuned models despite having seen zero training examples. Error bars denote 95% confidence intervals.
  • Figure 3: Effect of history-dependent regularization strength. We sweep the decay parameter $\lambda$ that downweights repeated target events during training and report OnTime and AUPRC across conditions. All main downstream results use a single global setting with intermediate level $\lambda^\star = 0.5$.
  • Figure 4: Effect of recurrence-aware regularization on zero-shot disease onset prediction.a, Change in AUROC from no regularization ($\lambda$ = 1.0) across seven conditions at 2-year and 5-year horizons. Bars above zero indicate improved discrimination under regularization. b, Corresponding change in F1 score. Dashed vertical lines separate the 2-year and 5-year evaluation groups. c, Per-condition AUROC as a function of the decay parameter $\lambda$ for the 2-year and 5-year horizons, with shaded 95% confidence intervals. The macro-average panel (bottom right) summarizes the overall trend. All results use 144M model trained on the full dataset; lower $\lambda$ corresponds to stronger penalization of repeated clinical events.
  • Figure 5: Compute-saturated scaling of standard and RAVEN pretraining. (a) Validation loss and test loss under the standard next-visit objective as a function of model size across multiple dataset budgets. The fitted minima shift toward larger models as the data budget increases, indicating that the optimal capacity depends strongly on the amount of available training data. (b) The same scaling analysis with full RAVEN history-dependent regularization enabled during pretraining, showing how the evidence transfers to training with recurrence-aware weighting.
  • ...and 1 more figures