Table of Contents
Fetching ...

EchoFM: Foundation Model for Generalizable Echocardiogram Analysis

Sekeun Kim, Pengfei Jin, Sifan Song, Cheng Chen, Yiwei Li, Hui Ren, Xiang Li, Tianming Liu, Quanzheng Li

TL;DR

EchoFM addresses the need for a generalizable echocardiography backbone by combining a spatio-temporally masked autoencoder with periodic contrastive learning to capture the cyclic nature of cardiac motion. It pretrains on a massive, multi-center echocardiography corpus (over $290{,}000$ videos, up to $20$ million frames) across $26$ scan views and multiple imaging modes, then fine-tunes adapters for downstream tasks. The framework integrates a high masking ratio with Uniform-Frame Masking and Spatio-temporal Consistent Masking, plus a temporal self-similarity based triplet loss, yielding a total loss $L_{total}=L_r+L_c$. Across view identification, chamber segmentation, and disease severity tasks (AS/AR), EchoFM consistently outperforms state-of-the-art methods, demonstrating strong generalization to unseen datasets and clinical workflows with notable data-efficiency and robustness to domain shifts. This approach promises to enhance echocardiography analysis by providing a robust, adaptable backbone for diverse clinical applications.

Abstract

Foundation models have recently gained significant attention because of their generalizability and adaptability across multiple tasks and data distributions. Although medical foundation models have emerged, solutions for cardiac imaging, especially echocardiography videos, are still unexplored. In this paper, we introduce EchoFM, a foundation model specifically designed to represent and analyze echocardiography videos. In EchoFM, we propose a self-supervised learning framework that captures both spatial and temporal variability patterns through a spatio-temporal consistent masking strategy and periodic-driven contrastive learning. This framework can effectively capture the spatio-temporal dynamics of echocardiography and learn the representative video features without any labels. We pre-train our model on an extensive dataset comprising over 290,000 echocardiography videos covering 26 scan views across different imaging modes, with up to 20 million frames of images. The pre-trained EchoFM can then be easily adapted and fine-tuned for a variety of downstream tasks, serving as a robust backbone model. Our evaluation was systemically designed for four downstream tasks after the echocardiography examination routine. Experiment results show that EchoFM surpasses state-of-the-art methods, including specialized echocardiography methods, self-supervised pre-training models, and general-purposed pre-trained foundation models, across all downstream tasks.

EchoFM: Foundation Model for Generalizable Echocardiogram Analysis

TL;DR

EchoFM addresses the need for a generalizable echocardiography backbone by combining a spatio-temporally masked autoencoder with periodic contrastive learning to capture the cyclic nature of cardiac motion. It pretrains on a massive, multi-center echocardiography corpus (over videos, up to million frames) across scan views and multiple imaging modes, then fine-tunes adapters for downstream tasks. The framework integrates a high masking ratio with Uniform-Frame Masking and Spatio-temporal Consistent Masking, plus a temporal self-similarity based triplet loss, yielding a total loss . Across view identification, chamber segmentation, and disease severity tasks (AS/AR), EchoFM consistently outperforms state-of-the-art methods, demonstrating strong generalization to unseen datasets and clinical workflows with notable data-efficiency and robustness to domain shifts. This approach promises to enhance echocardiography analysis by providing a robust, adaptable backbone for diverse clinical applications.

Abstract

Foundation models have recently gained significant attention because of their generalizability and adaptability across multiple tasks and data distributions. Although medical foundation models have emerged, solutions for cardiac imaging, especially echocardiography videos, are still unexplored. In this paper, we introduce EchoFM, a foundation model specifically designed to represent and analyze echocardiography videos. In EchoFM, we propose a self-supervised learning framework that captures both spatial and temporal variability patterns through a spatio-temporal consistent masking strategy and periodic-driven contrastive learning. This framework can effectively capture the spatio-temporal dynamics of echocardiography and learn the representative video features without any labels. We pre-train our model on an extensive dataset comprising over 290,000 echocardiography videos covering 26 scan views across different imaging modes, with up to 20 million frames of images. The pre-trained EchoFM can then be easily adapted and fine-tuned for a variety of downstream tasks, serving as a robust backbone model. Our evaluation was systemically designed for four downstream tasks after the echocardiography examination routine. Experiment results show that EchoFM surpasses state-of-the-art methods, including specialized echocardiography methods, self-supervised pre-training models, and general-purposed pre-trained foundation models, across all downstream tasks.

Paper Structure

This paper contains 26 sections, 4 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: (a) Key characteristics of B-mode echocardiography include its low signal-to-noise ratio and periodic temporal sequences. Unlike other imaging modalities, echocardiography features diverse scanning views and multiple imaging modes, enabling comprehensive cardiac assessment. (b) Typical downstream tasks in routine echocardiography include view identification and chamber segmentation, with a focus on disease diagnosis, such as assessing the severity of aortic stenosis and aortic regurgitation.
  • Figure 2: The overview of the proposed EchoFM. We extract spatio-temporal patches, keeping mask-ratio along temporal domain. The visible patches are processed by the ViT encoder and extracted latent representation are grouped into temporal dimension. The decoder reconstructs the missing patches in the video input. The grouped spatio-tempopral patches in temporal dimension are processed by a ViT-based projector to extract [CLS] tokens independently. We build the temporal self-similarity matrix by calculating similarity between [CLS] tokens. We sample triplet pairs, then the Spatio-temporal Consistent masking applied to triplet patches. We minimize periodic contrastive loss and reconstruction loss until network convergence. The encoder is attached to a task-specific decoder for fine-tuning and used for downstream tasks.
  • Figure 3: Different masking strategies: (a) Random masking, (b) Uniform-frame masking, which maintains the same mask ratio across the temporal dimension, and (c) Spatio-temporal consistent masking, which applies the same mask to selected samples.
  • Figure 4: (a) Quantitative comparison of Echocardiography segmentation performance in the Dice metric. The Dice metric for each trial is presented with box-and-whiskers plot representing the range from minimum to maximum values. The p values indicate the statistically significant superiority of the proposed model. All statistical tests were two-sided. Visual comparison of EchoFM and the Second-Best model in CAMUS (b) and Multi-center dataset (c).
  • Figure 5: Comparison of temporal self-similarity matrices produced by two methods: (a) VideoMAE and (b) the proposed EchoFM, illustrating the temporal self-similarity matrix within one and two cardiac cycles.
  • ...and 2 more figures