Table of Contents
Fetching ...

Opportunistic Cardiac Health Assessment: Estimating Phenotypes from Localizer MRI through Multi-Modal Representations

Busra Nur Zeybek, Özgün Turgut, Yundi Zhang, Jiazhen Pan, Robert Graf, Sophie Starck, Daniel Rueckert, Sevgi Gokce Kafali

Abstract

Cardiovascular diseases are the leading cause of death. Cardiac phenotypes (CPs), e.g., ejection fraction, are the gold standard for assessing cardiac health, but they are derived from cine cardiac magnetic resonance imaging (CMR), which is costly and requires high spatio-temporal resolution. Every magnetic resonance (MR) examination begins with rapid and coarse localizers for scan planning, which are discarded thereafter. Despite non-diagnostic image quality and lack of temporal information, localizers can provide valuable structural information rapidly. In addition to imaging, patient-level information, including demographics and lifestyle, influence the cardiac health assessment. Electrocardiograms (ECGs) are inexpensive, routinely ordered in clinical practice, and capture the temporal activity of the heart. Here, we introduce C-TRIP (Cardiac Tri-modal Representations for Imaging Phenotypes), a multi-modal framework that aligns localizer MRI, ECG signals, and tabular metadata to learn a robust latent space and predict CPs using localizer images as an opportunistic alternative to CMR. By combining these three modalities, we leverage cheap spatial and temporal information from localizers, and ECG, respectively while benefiting from patient-specific context provided by tabular data. Our pipeline consists of three stages. First, encoders are trained independently to learn uni-modal representations. The second stage fuses the pre-trained encoders to unify the latent space. The final stage uses the enriched representation space for CP prediction, with inference performed exclusively on localizer MRI. Proposed C-TRIP yields accurate functional CPs, and high correlations for structural CPs. Since localizers are inherently rapid and low-cost, our C-TRIP framework could enable better accessibility for CP estimation.

Opportunistic Cardiac Health Assessment: Estimating Phenotypes from Localizer MRI through Multi-Modal Representations

Abstract

Cardiovascular diseases are the leading cause of death. Cardiac phenotypes (CPs), e.g., ejection fraction, are the gold standard for assessing cardiac health, but they are derived from cine cardiac magnetic resonance imaging (CMR), which is costly and requires high spatio-temporal resolution. Every magnetic resonance (MR) examination begins with rapid and coarse localizers for scan planning, which are discarded thereafter. Despite non-diagnostic image quality and lack of temporal information, localizers can provide valuable structural information rapidly. In addition to imaging, patient-level information, including demographics and lifestyle, influence the cardiac health assessment. Electrocardiograms (ECGs) are inexpensive, routinely ordered in clinical practice, and capture the temporal activity of the heart. Here, we introduce C-TRIP (Cardiac Tri-modal Representations for Imaging Phenotypes), a multi-modal framework that aligns localizer MRI, ECG signals, and tabular metadata to learn a robust latent space and predict CPs using localizer images as an opportunistic alternative to CMR. By combining these three modalities, we leverage cheap spatial and temporal information from localizers, and ECG, respectively while benefiting from patient-specific context provided by tabular data. Our pipeline consists of three stages. First, encoders are trained independently to learn uni-modal representations. The second stage fuses the pre-trained encoders to unify the latent space. The final stage uses the enriched representation space for CP prediction, with inference performed exclusively on localizer MRI. Proposed C-TRIP yields accurate functional CPs, and high correlations for structural CPs. Since localizers are inherently rapid and low-cost, our C-TRIP framework could enable better accessibility for CP estimation.
Paper Structure (12 sections, 2 equations, 3 figures, 2 tables)

This paper contains 12 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Our multi-modal contrastive learning framework. First stage includes pre-training encoders for each modality. Second stage involves unification of the latent spaces from three modalities, in which the data from the same subjects were pulled, and different subjects were pushed away. The last stage uses a regression head to predict all 18 cardiac phenotypes solely from localizer MRIs.
  • Figure 2: A) Attention maps. Maps represent the [CLS] token's self-attention weights from the final ViT block, head-averaged and up-sampled. While the supervised localizer baseline, $L_{sup}$, exhibits leaked attention to non-cardiac structures (e.g., lungs and chest wall), C-TRIP was able to focus on biologically relevant structures. Images are cropped to the heart region for visibility. B) UMAPs showing representation space before and after alignment. Our localizer-centric design reflects itself in UMAPs. Due to strong anatomical correlation, $L$ and $T$ merge easily. The relationship between $L$ and $E$ is more complex, so $E$ remains more distinct, but is pulled toward the same phenotypic gradient/trend from low to high LVM, or male/female distinction.
  • Figure 3: Performance and scaling comparison of our proposed multimodal approach against unimodal supervised baselines across four CPs.(Top Row) Scatter plots comparing model predictions to ground truth values using 100% of the available fine-tuning data for LVEF, LVM, RVEF, RVEDV. Pearson $R$ is reported in the legends. The plots show 500 subjects for better visibility. (Bottom Row) Fine-tuning data scaling behavior. The plots demonstrate model performance (Pearson $R$) as a function of the fine-tuning subset size (1%, 10%, and 100%, displayed on a logarithmic scale). The shaded regions represent the 95% confidence intervals. Our proposed method C-TRIP (in green) is compared against single modality baselines.