Table of Contents
Fetching ...

ECG-Soup: Harnessing Multi-Layer Synergy for ECG Foundation Models

Phu X. Nguyen, Huy Phan, Hieu Pham, Christos Chatzichristos, Bert Vandenberk, Maarten De Vos

TL;DR

ECG-Soup investigates how intermediate layers of pretrained 1-D Vision Transformers encode ECG information and how to fuse multi-layer representations for robust downstream classification. The authors introduce three cross-layer aggregation schemes (PPA, PMA, IPASTMEM) built on a Spatio-Temporal Masked Electrocardiogram Modeling backbone (STMEM) and provide theoretical insights into attention dynamics. Empirical results across multiple ECG datasets show that middle layers offer richer, more generalizable features, and PMA and IPASTMEM consistently outperform baselines in both in-distribution and out-of-distribution settings. The work highlights the practical impact of multi-layer representation fusion for ECG foundation models and points to future multimodal extensions for zero-shot learning in biomedical tasks.

Abstract

Transformer-based foundation models for Electrocardiograms (ECGs) have recently achieved impressive performance in many downstream applications.

ECG-Soup: Harnessing Multi-Layer Synergy for ECG Foundation Models

TL;DR

ECG-Soup investigates how intermediate layers of pretrained 1-D Vision Transformers encode ECG information and how to fuse multi-layer representations for robust downstream classification. The authors introduce three cross-layer aggregation schemes (PPA, PMA, IPASTMEM) built on a Spatio-Temporal Masked Electrocardiogram Modeling backbone (STMEM) and provide theoretical insights into attention dynamics. Empirical results across multiple ECG datasets show that middle layers offer richer, more generalizable features, and PMA and IPASTMEM consistently outperform baselines in both in-distribution and out-of-distribution settings. The work highlights the practical impact of multi-layer representation fusion for ECG foundation models and points to future multimodal extensions for zero-shot learning in biomedical tasks.

Abstract

Transformer-based foundation models for Electrocardiograms (ECGs) have recently achieved impressive performance in many downstream applications.

Paper Structure

This paper contains 30 sections, 38 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Representation analysis across layers of STMEM-based pretrained ViT on the PTB-XL dataset, a large publicly available clinical 12-lead ECG database.
  • Figure 2: Overview of post-pretraining mixture-of-layers aggregation of pretrained ViT's different layer-wise representations.
  • Figure 3: Average cosine similarity through inner layers of pretrained ViT models.
  • Figure 4: Cosine similarity maps of 12-lead ECG provide whole spatial and temporal information regarding the heart: precordial leads (V1-V6) and limb leads (I, II, III, AVR, AVL, AVF). The above figure shows cosine similarity maps for a query patch (i.e., red dashed box) in lead V2 and the remaining patches in two pretrained models: (a) STMEM and (b) IPASTMEM.
  • Figure 5: Average attention entropy through inner layers of pretrained ViT models.
  • ...and 4 more figures