Table of Contents
Fetching ...

Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals

Aruna Mohan, Danne Elbers, Or Zilbershot, Fatemeh Afghah, David Vorchheimer

TL;DR

This work addresses the need for explainable atrial fibrillation detection from single-lead ECG by comparing a Vision Transformer (ViT) approach to a ResNet baseline on RRR heartbeat segments from the Chapman–Shaoxing dataset. By extracting complete heartbeat segments between three consecutive R-peaks and testing both non-normalized and $z$-normalized signals, the study shows that ResNet achieves >96% accuracy, while ViT reaches ~0.92–0.93 accuracy with non-normalized inputs and benefits from explainability via attention maps highlighting P-wave and T-wave regions. The results demonstrate that segment length and amplitude are informative features, and attention/Grad-CAM visualizations align with clinical waveform characteristics, providing a transparent rationale for classifications. The ViT approach offers faster inference and potential deployment on wearables, though it currently requires larger datasets to realize its full performance potential; future work aims to scale ViT with more data and extend explainable methods to additional cardiac conditions.

Abstract

Remote patient monitoring based on wearable single-lead electrocardiogram (ECG) devices has significant potential for enabling the early detection of heart disease, especially in combination with artificial intelligence (AI) approaches for automated heart disease detection. There have been prior studies applying AI approaches based on deep learning for heart disease detection. However, these models are yet to be widely accepted as a reliable aid for clinical diagnostics, in part due to the current black-box perception surrounding many AI algorithms. In particular, there is a need to identify the key features of the ECG signal that contribute toward making an accurate diagnosis, thereby enhancing the interpretability of the model. In the present study, we develop a vision transformer approach to identify atrial fibrillation based on single-lead ECG data. A residual network (ResNet) approach is also developed for comparison with the vision transformer approach. These models are applied to the Chapman-Shaoxing dataset to classify atrial fibrillation, as well as another common arrhythmia, sinus bradycardia, and normal sinus rhythm heartbeats. The models enable the identification of the key regions of the heartbeat that determine the resulting classification, and highlight the importance of P-waves and T-waves, as well as heartbeat duration and signal amplitude, in distinguishing normal sinus rhythm from atrial fibrillation and sinus bradycardia.

Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals

TL;DR

This work addresses the need for explainable atrial fibrillation detection from single-lead ECG by comparing a Vision Transformer (ViT) approach to a ResNet baseline on RRR heartbeat segments from the Chapman–Shaoxing dataset. By extracting complete heartbeat segments between three consecutive R-peaks and testing both non-normalized and -normalized signals, the study shows that ResNet achieves >96% accuracy, while ViT reaches ~0.92–0.93 accuracy with non-normalized inputs and benefits from explainability via attention maps highlighting P-wave and T-wave regions. The results demonstrate that segment length and amplitude are informative features, and attention/Grad-CAM visualizations align with clinical waveform characteristics, providing a transparent rationale for classifications. The ViT approach offers faster inference and potential deployment on wearables, though it currently requires larger datasets to realize its full performance potential; future work aims to scale ViT with more data and extend explainable methods to additional cardiac conditions.

Abstract

Remote patient monitoring based on wearable single-lead electrocardiogram (ECG) devices has significant potential for enabling the early detection of heart disease, especially in combination with artificial intelligence (AI) approaches for automated heart disease detection. There have been prior studies applying AI approaches based on deep learning for heart disease detection. However, these models are yet to be widely accepted as a reliable aid for clinical diagnostics, in part due to the current black-box perception surrounding many AI algorithms. In particular, there is a need to identify the key features of the ECG signal that contribute toward making an accurate diagnosis, thereby enhancing the interpretability of the model. In the present study, we develop a vision transformer approach to identify atrial fibrillation based on single-lead ECG data. A residual network (ResNet) approach is also developed for comparison with the vision transformer approach. These models are applied to the Chapman-Shaoxing dataset to classify atrial fibrillation, as well as another common arrhythmia, sinus bradycardia, and normal sinus rhythm heartbeats. The models enable the identification of the key regions of the heartbeat that determine the resulting classification, and highlight the importance of P-waves and T-waves, as well as heartbeat duration and signal amplitude, in distinguishing normal sinus rhythm from atrial fibrillation and sinus bradycardia.
Paper Structure (6 sections, 6 figures, 5 tables)

This paper contains 6 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Histogram of RRR segment lengths for AFIB, SB and SR. The $x$-axis represents the segment length, and the $y$-axis represents the percentage in each bin.
  • Figure 2: Illustration of the ViT Approach. A detailed diagram of the transformer encoder was presented in dosovitskiy2021an.
  • Figure 3: Attention heatmaps from the ViT model using 2 attention heads, based on correctly classified AFIB cases. The blue line represents the average signal amplitude, with the gray region corresponding to $\pm$ 1 standard deviation. The $y$-axis represents the amplitude in $\mu$V, while the $x$-axis represents the index.
  • Figure 4: Attention heatmaps from the ViT model using 2 attention heads, based on correctly classified SB cases. The blue line represents the average signal amplitude, with the gray region corresponding to $\pm$ 1 standard deviation. The $y$-axis represents the amplitude in $\mu$V, while the $x$-axis represents the index.
  • Figure 5: Attention heatmaps from the ViT model using 2 attention heads, based on correctly classified SR cases. The blue line represents the average signal amplitude, with the gray region corresponding to $\pm$ 1 standard deviation. The $y$-axis represents the amplitude in $\mu$V, while the $x$-axis represents the index.
  • ...and 1 more figures