Table of Contents
Fetching ...

Benchmarking ECG FMs: A Reality Check Across Clinical Tasks

M A Al-Masud, Juan Miguel Lopez Alcaraz, Nils Strodthoff

TL;DR

Overall, while FMs show promise for adult ECG analysis, substantial gaps remain in cardiac structure, outcome prediction, and patient characterization, demonstrating that architecture matters more than scale.

Abstract

The 12-lead electrocardiogram (ECG) is a long-standing diagnostic tool. Yet machine learning for ECG interpretation remains fragmented, often limited to narrow tasks or datasets. FMs promise broader adaptability, but fundamental questions remain: Which architectures generalize best? How do models scale with limited labels? What explains performance differences across model families? We benchmarked eight ECG FMs on 26 clinically relevant tasks using 12 public datasets comprising 1,650 regression and classification targets. Models were evaluated under fine-tuning and frozen settings, with scaling analyses across dataset sizes. Results show heterogeneous performance across domains: in adult ECG interpretation, three FMs consistently outperformed strong supervised baselines. In contrast, ECG-CPC, a compact structured state-space model, dominated 5 of 7 task categories, demonstrating that architecture matters more than scale. FMs improved label efficiency 3.3-9x over supervised baselines, though scaling behaviors varied across architectures. Representation analysis reveals that models with similar performance learn markedly different internal structures, suggesting multiple viable paths to effective ECG representation. Overall, while FMs show promise for adult ECG analysis, substantial gaps remain in cardiac structure, outcome prediction, and patient characterization. ECG-CPC's strong performance despite being orders of magnitude smaller challenges the assumption that FM quality requires massive scale, highlighting architectural inductive biases as an untapped opportunity.

Benchmarking ECG FMs: A Reality Check Across Clinical Tasks

TL;DR

Overall, while FMs show promise for adult ECG analysis, substantial gaps remain in cardiac structure, outcome prediction, and patient characterization, demonstrating that architecture matters more than scale.

Abstract

The 12-lead electrocardiogram (ECG) is a long-standing diagnostic tool. Yet machine learning for ECG interpretation remains fragmented, often limited to narrow tasks or datasets. FMs promise broader adaptability, but fundamental questions remain: Which architectures generalize best? How do models scale with limited labels? What explains performance differences across model families? We benchmarked eight ECG FMs on 26 clinically relevant tasks using 12 public datasets comprising 1,650 regression and classification targets. Models were evaluated under fine-tuning and frozen settings, with scaling analyses across dataset sizes. Results show heterogeneous performance across domains: in adult ECG interpretation, three FMs consistently outperformed strong supervised baselines. In contrast, ECG-CPC, a compact structured state-space model, dominated 5 of 7 task categories, demonstrating that architecture matters more than scale. FMs improved label efficiency 3.3-9x over supervised baselines, though scaling behaviors varied across architectures. Representation analysis reveals that models with similar performance learn markedly different internal structures, suggesting multiple viable paths to effective ECG representation. Overall, while FMs show promise for adult ECG analysis, substantial gaps remain in cardiac structure, outcome prediction, and patient characterization. ECG-CPC's strong performance despite being orders of magnitude smaller challenges the assumption that FM quality requires massive scale, highlighting architectural inductive biases as an untapped opportunity.

Paper Structure

This paper contains 36 sections, 6 figures, 34 tables.

Figures (6)

  • Figure 1: Overview of the benchmarking pipeline for ECG FMs.
  • Figure 2: Radar plots summarizing model performance ranks (lower rank indicates statistically significantly better performance) for the eight FMs across the 7 investigated tasks. Our investigated ranking criteria accounts for confidence interval overlaps within each of the tasks and datasets. The plot is based on data from Table \ref{['tab:median_ranking_table']}.
  • Figure 3: Scaling with dataset size on EchoNext across the best-performing FMs and supervised baseline.
  • Figure 4: Intra-model layer-wise representation similarity analysis. CKA heatmaps comparing all internal layers within each of the best performing FM on PTB-XL (all) dataset. Higher values (yellow) indicate similar representations between layers. CKA computed using Gaussian RBF kernel ($\sigma=1.0$) on 2,500 samples per model.
  • Figure 5: Inter-model representational similarity across network depths. CKA heatmaps comparing corresponding stages across four FMs (ECGFounder, ECG-JEPA, ST-MEM, ECG-CPC). Higher values (yellow) indicate similar representations between layers. CKA computed using Gaussian RBF kernel ($\sigma=1.0$) on 2,500 samples of PTB-XL (all) dataset per model.
  • ...and 1 more figures