Table of Contents
Fetching ...

Sensing Cardiac Health Across Scenarios and Devices: A Multi-Modal Foundation Model Pretrained on Heterogeneous Data from 1.7 Million Individuals

Xiao Gu, Wei Tang, Jinpei Han, Veer Sangha, Fenglin Liu, Shreyank N Gowda, Antonio H. Ribeiro, Patrick Schwab, Kim Branson, Lei Clifton, Antonio Luiz P. Ribeiro, Zhangdaihong Liu, David A. Clifton

Abstract

Cardiac biosignals, such as electrocardiograms (ECG) and photoplethysmograms (PPG), are of paramount importance for the diagnosis, prevention, and management of cardiovascular diseases, and have been extensively used in a variety of clinical tasks. Conventional deep learning approaches for analyzing these signals typically rely on homogeneous datasets and static bespoke models, limiting their robustness and generalizability across diverse clinical settings and acquisition protocols. In this study, we present a cardiac sensing foundation model (CSFM) that leverages advanced transformer architectures and a generative, masked pretraining strategy to learn unified representations from vast, heterogeneous health records. Our model is pretrained on an innovative multi-modal integration of data from multiple large-scale datasets (including MIMIC-III-WDB, MIMIC-IV-ECG, and CODE), comprising cardiac signals and the corresponding clinical or machine-generated text reports from approximately 1.7 million individuals. We demonstrate that the embeddings derived from our CSFM not only serve as effective feature extractors across diverse cardiac sensing scenarios, but also enable seamless transfer learning across varying input configurations and sensor modalities. Extensive evaluations across diagnostic tasks, demographic information recognition, vital sign measurement, clinical outcome prediction, and ECG question answering reveal that CSFM consistently outperforms traditional one-modal-one-task approaches. Notably, CSFM exhibits robust performance across multiple ECG lead configurations from standard 12-lead systems to single-lead setups, and in scenarios where only ECG, only PPG, or a combination thereof is available. These findings highlight the potential of CSFM as a versatile and scalable solution, for comprehensive cardiac monitoring.

Sensing Cardiac Health Across Scenarios and Devices: A Multi-Modal Foundation Model Pretrained on Heterogeneous Data from 1.7 Million Individuals

Abstract

Cardiac biosignals, such as electrocardiograms (ECG) and photoplethysmograms (PPG), are of paramount importance for the diagnosis, prevention, and management of cardiovascular diseases, and have been extensively used in a variety of clinical tasks. Conventional deep learning approaches for analyzing these signals typically rely on homogeneous datasets and static bespoke models, limiting their robustness and generalizability across diverse clinical settings and acquisition protocols. In this study, we present a cardiac sensing foundation model (CSFM) that leverages advanced transformer architectures and a generative, masked pretraining strategy to learn unified representations from vast, heterogeneous health records. Our model is pretrained on an innovative multi-modal integration of data from multiple large-scale datasets (including MIMIC-III-WDB, MIMIC-IV-ECG, and CODE), comprising cardiac signals and the corresponding clinical or machine-generated text reports from approximately 1.7 million individuals. We demonstrate that the embeddings derived from our CSFM not only serve as effective feature extractors across diverse cardiac sensing scenarios, but also enable seamless transfer learning across varying input configurations and sensor modalities. Extensive evaluations across diagnostic tasks, demographic information recognition, vital sign measurement, clinical outcome prediction, and ECG question answering reveal that CSFM consistently outperforms traditional one-modal-one-task approaches. Notably, CSFM exhibits robust performance across multiple ECG lead configurations from standard 12-lead systems to single-lead setups, and in scenarios where only ECG, only PPG, or a combination thereof is available. These findings highlight the potential of CSFM as a versatile and scalable solution, for comprehensive cardiac monitoring.

Paper Structure

This paper contains 16 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Illustration of the cardiac sensing foundation model (CSFM), capable of taking heterogeneous biomedical signals as input and versatile for different cardiac health-related downstream tasks. Given the diversity in sensing scenarios and devices, the collected biomedical sensing data are varied in both the signal types and the channels for each record. We trained CSFM on an innovative integration of multiple cardiac sensing datasets collected from around 1.7 million individuals and assessed its performance in diverse healthcare scenarios. The generalization capability of CSFM is tested across different scenarios from hospital to home, on representative tasks. These include cardiac disease diagnosis, demographic information reocgnition, vital sign measurement, clinical outcome prediction (spanning short-term ICU alerts to long-term mortality), and ECG-based question answering.
  • Figure 2: Statistics of our training and testing datasets.a. Illustration of pretraining datasets. Our pretraining dataset is aggregated from heterogeneous records across multiple sources, including MIMIC-III-WDB, MIMIC-IV-ECG, and CODE-Full. It is also noteworthy that while MIMIC-IV and MIMIC-III-WDB may contain overlapping subjects, all records are de-identified, making subject linkage impossible. Their data segments were collected from distinct clinical scenarios. The left plot illustrates the number of recorded segments across datasets, while the right plot represents the number of segments across different signal modalities. b. Illustration of downstream tasks and datasets. Our downstream evaluation spans five cardiology-related scenarios, including cardiovascular disease diagnosis (CDD), demographic information recognition (DIR), Vital Sign Measurement (VSM), Clinical Outcome Prediction (COP), and Question Answering (QA). The downstream datasets were collected from multiple sources, including CinC17 clifford2017af, PTB-XL wagner2020ptb, SimBand shashikumar2017deep, VTaC lehman2024vtac, CODE-15 lima2021deepribeiro2020automatic. The figure on the right summarizes the distribution of signals across different modalities.
  • Figure 3: Overall performance across different healthcare scenarios, validated on corresponding downstream datasets, separately.a. Cardiovascular disease diagnosis across different datasets. The performance was measured by Macro-F1 in terms of multi-label/class classification. b. Demographic information recognition. Age and BMI prediction (univariate regression) was measured by mean absolute error (MAE, lower is better), whereas gender prediction (binary classification) was measured by AUC (higher is better). c. Blood pressure waveform reconstruction based on Lead-II ECG and PPG as input. We compared both the error of derived numeric values (i.e., SBP and DBP), as well as the raw waveforms. The R-squared values of the derived SBP and DBP against the ground truths, were calculated. d. 1-Year mortality prediction based on 12-Lead diagnostic ECGs. Receiver operating characteristic (ROC) curve is presented. e. ICU false alarm prediction based on signals (ECG & PPG) right before the alarm. Receiver operating characteristic (ROC) curve is presented. f. ECG Question Answering with paired ECGs and questions. Question answering was formulated as a multi-choice QA system in which, for each question template, the model selects the most appropriate answers from a set of candidate responses. Performance was measured using the macro-F1 score, computed over only the valid candidate answers for each question.
  • Figure 4: Performance under different channel settings (full 12-Lead, 6-Lead, 2-Lead, Lead II) or the combinations of different sensing modalities (ECG and PPG). In particular, 6-Lead utilizes {Lead I, II, III, aVL, aVR, and aVF}, and 2-Lead utilizes {Lead II and V5}. The experiments on ECG lead variations were performed for cardiovascular disease diagnosis on PTB-XL (leftmost), mortality prediction on CODE-15 (middle). In addition, we also examined the model's generalization performance across different sensing modalities for ICU false alarm prediction, on VTaC (rightmost).
  • Figure 5: ECG Question Answering, for lead-related questions with only lead-II ECG as input. We selected a subset of questions that are intuitively related to leads (including the keyword "lead"), with representative examples illustrated in the accompanying plot. We compared the performance of the Fusion Transformer and CSFM when restricted to Lead II input, and also reported their performance when trained or fine-tuned on all 12 leads. Performance was measured using the macro-F1 score, calculated based on the valid candidate options for each question.
  • ...and 3 more figures