EchoFM: Foundation Model for Generalizable Echocardiogram Analysis
Sekeun Kim, Pengfei Jin, Sifan Song, Cheng Chen, Yiwei Li, Hui Ren, Xiang Li, Tianming Liu, Quanzheng Li
TL;DR
EchoFM addresses the need for a generalizable echocardiography backbone by combining a spatio-temporally masked autoencoder with periodic contrastive learning to capture the cyclic nature of cardiac motion. It pretrains on a massive, multi-center echocardiography corpus (over $290{,}000$ videos, up to $20$ million frames) across $26$ scan views and multiple imaging modes, then fine-tunes adapters for downstream tasks. The framework integrates a high masking ratio with Uniform-Frame Masking and Spatio-temporal Consistent Masking, plus a temporal self-similarity based triplet loss, yielding a total loss $L_{total}=L_r+L_c$. Across view identification, chamber segmentation, and disease severity tasks (AS/AR), EchoFM consistently outperforms state-of-the-art methods, demonstrating strong generalization to unseen datasets and clinical workflows with notable data-efficiency and robustness to domain shifts. This approach promises to enhance echocardiography analysis by providing a robust, adaptable backbone for diverse clinical applications.
Abstract
Foundation models have recently gained significant attention because of their generalizability and adaptability across multiple tasks and data distributions. Although medical foundation models have emerged, solutions for cardiac imaging, especially echocardiography videos, are still unexplored. In this paper, we introduce EchoFM, a foundation model specifically designed to represent and analyze echocardiography videos. In EchoFM, we propose a self-supervised learning framework that captures both spatial and temporal variability patterns through a spatio-temporal consistent masking strategy and periodic-driven contrastive learning. This framework can effectively capture the spatio-temporal dynamics of echocardiography and learn the representative video features without any labels. We pre-train our model on an extensive dataset comprising over 290,000 echocardiography videos covering 26 scan views across different imaging modes, with up to 20 million frames of images. The pre-trained EchoFM can then be easily adapted and fine-tuned for a variety of downstream tasks, serving as a robust backbone model. Our evaluation was systemically designed for four downstream tasks after the echocardiography examination routine. Experiment results show that EchoFM surpasses state-of-the-art methods, including specialized echocardiography methods, self-supervised pre-training models, and general-purposed pre-trained foundation models, across all downstream tasks.
