Table of Contents
Fetching ...

CardioBench: Do Echocardiography Foundation Models Generalize Beyond the Lab?

Darya Taratynova, Ahmed Aly, Numan Saeed, Mohammad Yaqub

TL;DR

CardioBench addresses the lack of a standardized benchmark for echocardiography foundation models by unifying eight public datasets into a consolidated evaluation suite of four regression and five classification tasks that cover functional, structural, diagnostic, and view recognition endpoints. The paper assesses cardiac-specific, biomedical, and general-purpose encoders under zero-shot, probing, and alignment protocols, revealing that temporal modeling is crucial for EF regression, retrieval-based methods offer robustness under distribution shifts, and domain-specific text encoders can ground physiologic axes like EF. General-purpose encoders often transfer well and approach specialized models on some tasks, but struggle with fine-grained view classification and subtle pathologies, underscoring the value of hybrid designs and targeted supervision. By providing preprocessing, splits, and public evaluation pipelines, CardioBench establishes a reproducible reference for fair comparison and offers practical guidance for developing the next generation of clinically meaningful echocardiography foundation models.

Abstract

Foundation models (FMs) are reshaping medical imaging, yet their application in echocardiography remains limited. While several echocardiography-specific FMs have recently been introduced, no standardized benchmark exists to evaluate them. Echocardiography poses unique challenges, including noisy acquisitions, high frame redundancy, and limited public datasets. Most existing solutions evaluate on private data, restricting comparability. To address this, we introduce CardioBench, a comprehensive benchmark for echocardiography FMs. CardioBench unifies eight publicly available datasets into a standardized suite spanning four regression and five classification tasks, covering functional, structural, diagnostic, and view recognition endpoints. We evaluate several leading FM, including cardiac-specific, biomedical, and general-purpose encoders, under consistent zero-shot, probing, and alignment protocols. Our results highlight complementary strengths across model families: temporal modeling is critical for functional regression, retrieval provides robustness under distribution shift, and domain-specific text encoders capture physiologically meaningful axes. General-purpose encoders transfer strongly and often close the gap with probing, but struggle with fine-grained distinctions like view classification and subtle pathology recognition. By releasing preprocessing, splits, and public evaluation pipelines, CardioBench establishes a reproducible reference point and offers actionable insights to guide the design of future echocardiography foundation models.

CardioBench: Do Echocardiography Foundation Models Generalize Beyond the Lab?

TL;DR

CardioBench addresses the lack of a standardized benchmark for echocardiography foundation models by unifying eight public datasets into a consolidated evaluation suite of four regression and five classification tasks that cover functional, structural, diagnostic, and view recognition endpoints. The paper assesses cardiac-specific, biomedical, and general-purpose encoders under zero-shot, probing, and alignment protocols, revealing that temporal modeling is crucial for EF regression, retrieval-based methods offer robustness under distribution shifts, and domain-specific text encoders can ground physiologic axes like EF. General-purpose encoders often transfer well and approach specialized models on some tasks, but struggle with fine-grained view classification and subtle pathologies, underscoring the value of hybrid designs and targeted supervision. By providing preprocessing, splits, and public evaluation pipelines, CardioBench establishes a reproducible reference for fair comparison and offers practical guidance for developing the next generation of clinically meaningful echocardiography foundation models.

Abstract

Foundation models (FMs) are reshaping medical imaging, yet their application in echocardiography remains limited. While several echocardiography-specific FMs have recently been introduced, no standardized benchmark exists to evaluate them. Echocardiography poses unique challenges, including noisy acquisitions, high frame redundancy, and limited public datasets. Most existing solutions evaluate on private data, restricting comparability. To address this, we introduce CardioBench, a comprehensive benchmark for echocardiography FMs. CardioBench unifies eight publicly available datasets into a standardized suite spanning four regression and five classification tasks, covering functional, structural, diagnostic, and view recognition endpoints. We evaluate several leading FM, including cardiac-specific, biomedical, and general-purpose encoders, under consistent zero-shot, probing, and alignment protocols. Our results highlight complementary strengths across model families: temporal modeling is critical for functional regression, retrieval provides robustness under distribution shift, and domain-specific text encoders capture physiologically meaningful axes. General-purpose encoders transfer strongly and often close the gap with probing, but struggle with fine-grained distinctions like view classification and subtle pathology recognition. By releasing preprocessing, splits, and public evaluation pipelines, CardioBench establishes a reproducible reference point and offers actionable insights to guide the design of future echocardiography foundation models.

Paper Structure

This paper contains 33 sections, 21 figures, 19 tables.

Figures (21)

  • Figure 1: CardioBench is a standardized benchmark unifying 8 datasets, covering 4 regression tasks and 5 classification tasks across multi-view echocardiography.
  • Figure 2: The figure on the left shows frame-level cosine similarity matrices: natural video frames from the SumMe dataset (gygli2014creating) versus echocardiography video frames extracted using SigLIP2 (tschannen2025siglip). Echocardiography videos exhibit much higher frame-to-frame similarity compared to natural videos, making informative feature extraction more challenging. The figure on the right illustrates the number of echocardiography foundation models released each year: by mid-2025, there are 8 models published.
  • Figure 3: Absolute EF error distributions across demographic subgroups in CAMUS (a–c) and EchoNet-Pediatric (d–f).
  • Figure 4: Top row: EF text prompt embeddings projected into 2D. Rows 2--4: alignment of visual embeddings with the EF text axis for each dataset.
  • Figure 5: Left: Radar plots of view classification accuracy across datasets. Right: UMAP projection of TMED-2 embeddings with KNN probing results
  • ...and 16 more figures