CardioBench: Do Echocardiography Foundation Models Generalize Beyond the Lab?
Darya Taratynova, Ahmed Aly, Numan Saeed, Mohammad Yaqub
TL;DR
CardioBench addresses the lack of a standardized benchmark for echocardiography foundation models by unifying eight public datasets into a consolidated evaluation suite of four regression and five classification tasks that cover functional, structural, diagnostic, and view recognition endpoints. The paper assesses cardiac-specific, biomedical, and general-purpose encoders under zero-shot, probing, and alignment protocols, revealing that temporal modeling is crucial for EF regression, retrieval-based methods offer robustness under distribution shifts, and domain-specific text encoders can ground physiologic axes like EF. General-purpose encoders often transfer well and approach specialized models on some tasks, but struggle with fine-grained view classification and subtle pathologies, underscoring the value of hybrid designs and targeted supervision. By providing preprocessing, splits, and public evaluation pipelines, CardioBench establishes a reproducible reference for fair comparison and offers practical guidance for developing the next generation of clinically meaningful echocardiography foundation models.
Abstract
Foundation models (FMs) are reshaping medical imaging, yet their application in echocardiography remains limited. While several echocardiography-specific FMs have recently been introduced, no standardized benchmark exists to evaluate them. Echocardiography poses unique challenges, including noisy acquisitions, high frame redundancy, and limited public datasets. Most existing solutions evaluate on private data, restricting comparability. To address this, we introduce CardioBench, a comprehensive benchmark for echocardiography FMs. CardioBench unifies eight publicly available datasets into a standardized suite spanning four regression and five classification tasks, covering functional, structural, diagnostic, and view recognition endpoints. We evaluate several leading FM, including cardiac-specific, biomedical, and general-purpose encoders, under consistent zero-shot, probing, and alignment protocols. Our results highlight complementary strengths across model families: temporal modeling is critical for functional regression, retrieval provides robustness under distribution shift, and domain-specific text encoders capture physiologically meaningful axes. General-purpose encoders transfer strongly and often close the gap with probing, but struggle with fine-grained distinctions like view classification and subtle pathology recognition. By releasing preprocessing, splits, and public evaluation pipelines, CardioBench establishes a reproducible reference point and offers actionable insights to guide the design of future echocardiography foundation models.
