Table of Contents
Fetching ...

Generative Medical Event Models Improve with Scale

Shane Waxler, Paul Blazek, Davis White, Daniel Sneider, Kevin Chung, Mani Nagarathnam, Patrick Williams, Hank Voeller, Karen Wong, Matthew Swanhorst, Sheng Zhang, Naoto Usuyama, Cliff Wong, Tristan Naumann, Hoifung Poon, Andrew Loza, Daniella Meeker, Seth Hain, Rahul Shah

TL;DR

Curiosity introduces large-scale generative medical event models trained on Epic Cosmos to simulate patient health timelines. Using decoder-only transformers up to 1B parameters, Curiosity demonstrates zero-shot predictive power across diverse tasks, including disease risk, differential diagnosis, and healthcare utilization, while adhering to observed scaling laws. The study establishes compute-optimal scaling relationships and shows that both training loss and inference-time compute predictably improve downstream clinical evaluations. This framework offers a generalizable, data-efficient path to real-world evidence generation and clinical decision support at scale, with broad implications for patient care and healthcare operations.

Abstract

Realizing personalized medicine at scale calls for methods that distill insights from longitudinal patient journeys, which can be viewed as a sequence of medical events. Foundation models pretrained on large-scale medical event data represent a promising direction for scaling real-world evidence generation and generalizing to diverse downstream tasks. Using Epic Cosmos, a dataset with medical events from de-identified longitudinal health records for 16.3 billion encounters over 300 million unique patient records from 310 health systems, we introduce the Curiosity models, a family of decoder-only transformer models pretrained on 118 million patients representing 115 billion discrete medical events (151 billion tokens). We present the largest scaling-law study of medical event data, establishing a methodology for pretraining and revealing power-law scaling relationships for compute, tokens, and model size. Consequently, we pretrained a series of compute-optimal models with up to 1 billion parameters. Conditioned on a patient's real-world history, Curiosity autoregressively predicts the next medical event to simulate patient health timelines. We studied 78 real-world tasks, including diagnosis prediction, disease prognosis, and healthcare operations. Remarkably for a foundation model with generic pretraining and simulation-based inference, Curiosity generally outperformed or matched task-specific supervised models on these tasks, without requiring task-specific fine-tuning or few-shot examples. Curiosity's predictive power consistently improves as the model and pretraining scale. Our results show that Curiosity, a generative medical event foundation model, can effectively capture complex clinical dynamics, providing an extensible and generalizable framework to support clinical decision-making, streamline healthcare operations, and improve patient outcomes.

Generative Medical Event Models Improve with Scale

TL;DR

Curiosity introduces large-scale generative medical event models trained on Epic Cosmos to simulate patient health timelines. Using decoder-only transformers up to 1B parameters, Curiosity demonstrates zero-shot predictive power across diverse tasks, including disease risk, differential diagnosis, and healthcare utilization, while adhering to observed scaling laws. The study establishes compute-optimal scaling relationships and shows that both training loss and inference-time compute predictably improve downstream clinical evaluations. This framework offers a generalizable, data-efficient path to real-world evidence generation and clinical decision support at scale, with broad implications for patient care and healthcare operations.

Abstract

Realizing personalized medicine at scale calls for methods that distill insights from longitudinal patient journeys, which can be viewed as a sequence of medical events. Foundation models pretrained on large-scale medical event data represent a promising direction for scaling real-world evidence generation and generalizing to diverse downstream tasks. Using Epic Cosmos, a dataset with medical events from de-identified longitudinal health records for 16.3 billion encounters over 300 million unique patient records from 310 health systems, we introduce the Curiosity models, a family of decoder-only transformer models pretrained on 118 million patients representing 115 billion discrete medical events (151 billion tokens). We present the largest scaling-law study of medical event data, establishing a methodology for pretraining and revealing power-law scaling relationships for compute, tokens, and model size. Consequently, we pretrained a series of compute-optimal models with up to 1 billion parameters. Conditioned on a patient's real-world history, Curiosity autoregressively predicts the next medical event to simulate patient health timelines. We studied 78 real-world tasks, including diagnosis prediction, disease prognosis, and healthcare operations. Remarkably for a foundation model with generic pretraining and simulation-based inference, Curiosity generally outperformed or matched task-specific supervised models on these tasks, without requiring task-specific fine-tuning or few-shot examples. Curiosity's predictive power consistently improves as the model and pretraining scale. Our results show that Curiosity, a generative medical event foundation model, can effectively capture complex clinical dynamics, providing an extensible and generalizable framework to support clinical decision-making, streamline healthcare operations, and improve patient outcomes.

Paper Structure

This paper contains 71 sections, 4 equations, 25 figures, 24 tables.

Figures (25)

  • Figure 1: Overview of Curiosity pretraining and inference. A patient journey is formulated as a sequence of medical events, and Curiosity learns by predicting the next medical event. At inference time, Curiosity is prompted with a patient's medical event history and simulates potential future trajectories by autoregressively generating the next events. Predictions for any target in Curiosity's vocabulary are obtained from these simulated trajectories, enabling broad, out-of-the-box use on downstream tasks without task-specific fine-tuning or few-shot prompts.
  • Figure 2: Overview of Curiosity evaluation performance. Each point shows the change in median evaluation scores for Curiosity-S, Curiosity-M, and Curiosity-L relative to the best-performing task-specific supervised model in each of the major evaluation categories. For AUCROC and PR-AUC, positive values indicate that Curiosity outperforms the task-specific model and negative values indicate underperformance while the opposite is true for MAE. Curiosity's performance improved with scale and generally matched or even outperformed the best task-specific supervised methods.
  • Figure 3: Calibration plots for encounter frequency. Curiosity-L predicted the probability of how many encounters each patient will have within the next year, for three encounter types (Office Visit, Emergency, and Inpatient). Each point represents a quantile group containing an equal number of patients with similar predicted probabilities. The horizontal position of each point reflects the group's average predicted probability and the vertical position reflects the fraction of patients in that group with the specified 1-year count of encounters. Some lines do not span the full horizontal axis because few patients had those predicted probabilities. The diagonal line indicates perfect probability calibration.
  • Figure 4: Medical events predicted for single encounters. For office visit, emergency visit, and inpatient admissions, 10,000 random encounters of each were selected, and their medical events were compared to the medical events that Curiosity predicted over 20 generations. The micro-averaged precisions and recalls are plotted over various thresholds for diagnosis, lab, medication, and procedure medical event types. In order to provide context on Curiosity's performance, we pooled the patient's past events over various lookback windows and plotted the precision and recall for each. Higher area under each curve indicates better performance.
  • Figure 5: T2DM-specific outcome predictions. Percent increase of AUCROC from the best-performing task-specific supervised model for each of the three Curiosity models on the T2DM-specific outcome prediction tasks.
  • ...and 20 more figures