Table of Contents
Fetching ...

Self-Supervised Pre-Training with Joint-Embedding Predictive Architecture Boosts ECG Classification Performance

Kuba Weimann, Tim O. F. Conrad

TL;DR

The joint-embedding predictive architecture (JEPA) for self-supervised learning from ECG data is explored, showing that JEPA outperforms existing invariance-based and generative approaches and proves advantageous for pre-training even in the absence of additional data.

Abstract

Accurate diagnosis of heart arrhythmias requires the interpretation of electrocardiograms (ECG), which capture the electrical activity of the heart. Automating this process through machine learning is challenging due to the need for large annotated datasets, which are difficult and costly to collect. To address this issue, transfer learning is often employed, where models are pre-trained on large datasets and fine-tuned for specific ECG classification tasks with limited labeled data. Self-supervised learning has become a widely adopted pre-training method, enabling models to learn meaningful representations from unlabeled datasets. In this work, we explore the joint-embedding predictive architecture (JEPA) for self-supervised learning from ECG data. Unlike invariance-based methods, JEPA does not rely on hand-crafted data augmentations, and unlike generative methods, it predicts latent features rather than reconstructing input data. We create a large unsupervised pre-training dataset by combining ten public ECG databases, amounting to over one million records. We pre-train Vision Transformers using JEPA on this dataset and fine-tune them on various PTB-XL benchmarks. Our results show that JEPA outperforms existing invariance-based and generative approaches, achieving an AUC of 0.945 on the PTB-XL all statements task. JEPA consistently learns the highest quality representations, as demonstrated in linear evaluations, and proves advantageous for pre-training even in the absence of additional data.

Self-Supervised Pre-Training with Joint-Embedding Predictive Architecture Boosts ECG Classification Performance

TL;DR

The joint-embedding predictive architecture (JEPA) for self-supervised learning from ECG data is explored, showing that JEPA outperforms existing invariance-based and generative approaches and proves advantageous for pre-training even in the absence of additional data.

Abstract

Accurate diagnosis of heart arrhythmias requires the interpretation of electrocardiograms (ECG), which capture the electrical activity of the heart. Automating this process through machine learning is challenging due to the need for large annotated datasets, which are difficult and costly to collect. To address this issue, transfer learning is often employed, where models are pre-trained on large datasets and fine-tuned for specific ECG classification tasks with limited labeled data. Self-supervised learning has become a widely adopted pre-training method, enabling models to learn meaningful representations from unlabeled datasets. In this work, we explore the joint-embedding predictive architecture (JEPA) for self-supervised learning from ECG data. Unlike invariance-based methods, JEPA does not rely on hand-crafted data augmentations, and unlike generative methods, it predicts latent features rather than reconstructing input data. We create a large unsupervised pre-training dataset by combining ten public ECG databases, amounting to over one million records. We pre-train Vision Transformers using JEPA on this dataset and fine-tune them on various PTB-XL benchmarks. Our results show that JEPA outperforms existing invariance-based and generative approaches, achieving an AUC of 0.945 on the PTB-XL all statements task. JEPA consistently learns the highest quality representations, as demonstrated in linear evaluations, and proves advantageous for pre-training even in the absence of additional data.

Paper Structure

This paper contains 31 sections, 1 equation, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Model overview. A 12-lead ECG $y$ (a) undergoes preprocessing and is divided into patches. Contiguous blocks of patches from $y$ are randomly masked, and the remaining patches are used to form an ECG $x$. Both ECGs are processed by the joint-embedding predictive architecture (JEPA) assran2023self for feature prediction in latent space (b). The y-encoder embeds the patches of $y$ to generate target embeddings. Simultaneously, the x-encoder processes the unmasked patches in $x$, which, along with mask tokens indicating the positions of the masked patches, are fed to the predictor. The predictor then outputs patch-level predictions for the masked patches. The model minimizes the $L_1$ reconstruction loss between the target and predicted embeddings. To prevent model collapse, the y-encoder is not directly trained, but rather updated using an exponential moving average of the x-encoder's weights.
  • Figure 2: Validation performance of ViT-S on PTB-XL downstream tasks. Higher proficiency at the pre-training objective does not necessarily translate to improved validation performance on downstream tasks.