Table of Contents
Fetching ...

Time-to-Event Pretraining for 3D Medical Imaging

Zepeng Huo, Jason Alan Fries, Alejandro Lozano, Jeya Maria Jose Valanarasu, Ethan Steinberg, Louis Blankemeier, Akshay S. Chaudhari, Curtis Langlotz, Nigam H. Shah

TL;DR

This work tackles the missing temporal context in 3D medical imaging pretraining by introducing time-to-event pretraining, which leverages large-scale longitudinal EHR data to generate thousands of prognostic tasks. Using 18,945 chest CTs linked to 225M clinical events, the authors train a 3D encoder with a time-to-event objective (8,192 tasks) and adapt it with a CoxPH head for prognosis and a classifier head for diagnosis. The approach yields substantial gains in prognostic metrics (average AUROC up about 0.13 and Harrell's C-index up about 0.16) and improved calibration, while preserving diagnostic performance on external tasks. The results demonstrate the value of incorporating longitudinal outcome data into 3D imaging pretraining, enabling better clinical risk prediction and paving the way for multi-modal, prognosis-oriented foundation models.

Abstract

With the rise of medical foundation models and the growing availability of imaging data, scalable pretraining techniques offer a promising way to identify imaging biomarkers predictive of future disease risk. While current self-supervised methods for 3D medical imaging models capture local structural features like organ morphology, they fail to link pixel biomarkers with long-term health outcomes due to a missing context problem. Current approaches lack the temporal context necessary to identify biomarkers correlated with disease progression, as they rely on supervision derived only from images and concurrent text descriptions. To address this, we introduce time-to-event pretraining, a pretraining framework for 3D medical imaging models that leverages large-scale temporal supervision from paired, longitudinal electronic health records (EHRs). Using a dataset of 18,945 CT scans (4.2 million 2D images) and time-to-event distributions across thousands of EHR-derived tasks, our method improves outcome prediction, achieving an average AUROC increase of 23.7% and a 29.4% gain in Harrell's C-index across 8 benchmark tasks. Importantly, these gains are achieved without sacrificing diagnostic classification performance. This study lays the foundation for integrating longitudinal EHR and 3D imaging data to advance clinical risk prediction.

Time-to-Event Pretraining for 3D Medical Imaging

TL;DR

This work tackles the missing temporal context in 3D medical imaging pretraining by introducing time-to-event pretraining, which leverages large-scale longitudinal EHR data to generate thousands of prognostic tasks. Using 18,945 chest CTs linked to 225M clinical events, the authors train a 3D encoder with a time-to-event objective (8,192 tasks) and adapt it with a CoxPH head for prognosis and a classifier head for diagnosis. The approach yields substantial gains in prognostic metrics (average AUROC up about 0.13 and Harrell's C-index up about 0.16) and improved calibration, while preserving diagnostic performance on external tasks. The results demonstrate the value of incorporating longitudinal outcome data into 3D imaging pretraining, enabling better clinical risk prediction and paving the way for multi-modal, prognosis-oriented foundation models.

Abstract

With the rise of medical foundation models and the growing availability of imaging data, scalable pretraining techniques offer a promising way to identify imaging biomarkers predictive of future disease risk. While current self-supervised methods for 3D medical imaging models capture local structural features like organ morphology, they fail to link pixel biomarkers with long-term health outcomes due to a missing context problem. Current approaches lack the temporal context necessary to identify biomarkers correlated with disease progression, as they rely on supervision derived only from images and concurrent text descriptions. To address this, we introduce time-to-event pretraining, a pretraining framework for 3D medical imaging models that leverages large-scale temporal supervision from paired, longitudinal electronic health records (EHRs). Using a dataset of 18,945 CT scans (4.2 million 2D images) and time-to-event distributions across thousands of EHR-derived tasks, our method improves outcome prediction, achieving an average AUROC increase of 23.7% and a 29.4% gain in Harrell's C-index across 8 benchmark tasks. Importantly, these gains are achieved without sacrificing diagnostic classification performance. This study lays the foundation for integrating longitudinal EHR and 3D imaging data to advance clinical risk prediction.

Paper Structure

This paper contains 43 sections, 6 equations, 20 figures, 27 tables.

Figures (20)

  • Figure 1: The missing context problem in medical imaging. Existing supervision sources (red boxes) are localized to the image itself (i.e., pixel features and descriptions of those features via text) or immediate clinical context via diagnosis codes. Doing so misses future information on disease progression (black boxes), which reduces the ability to learn correlations necessary for identifying prognostic pixel biomarkers. Time-to-event pretraining provides a principled framework for incorporating the vast amount of temporal supervision available in EHR data to estimate future risk in the presence of right censorship as well as leverage a large, diverse number of clinical tasks, beyond just diagnoses, for pre-training.
  • Figure 2: Overview of the proposed time-to-event pretraining pipeline. Patients' longitudinal EHR timelines are transformed into large-scale, time-to-event (TTE) pretraining tasks. These tasks, which reflect informative temporal patterns for medical outcome prediction, are then used for continued pretraining (full fine-tuning) of a 3D vision encoder. The resulting encoder is then frozen and adapted to downstream tasks via different task heads for classification or TTE estimation.
  • Figure 3: Label density CDF by pretraining approach.
  • Figure 4: Overview of Label Definitions: Diagnostic tasks use labels derived from the same hospital visit as the CT scan. Prognostic tasks involve future medical events from patients' EHR timelines and are categorized into binary prognostic labels and time-to-event (TTE) prognostic labels. Note that for TTE tasks, only the time until the first occurrence is labeled.
  • Figure 5: Overview of author contributions. * denotes equal contribution.
  • ...and 15 more figures