Table of Contents
Fetching ...

Stress Classification from ECG Signals Using Vision Transformer

Zeeshan Ahmad, Naimul Khan

Abstract

Vision Transformers have shown tremendous success in numerous computer vision applications; however, they have not been exploited for stress assessment using physiological signals such as Electrocardiogram (ECG). In order to get the maximum benefit from the vision transformer for multilevel stress assessment, in this paper, we transform the raw ECG data into 2D spectrograms using short time Fourier transform (STFT). These spectrograms are divided into patches for feeding to the transformer encoder. We also perform experiments with 1D CNN and ResNet-18 (CNN model). We perform leave-onesubject-out cross validation (LOSOCV) experiments on WESAD and Ryerson Multimedia Lab (RML) dataset. One of the biggest challenges of LOSOCV based experiments is to tackle the problem of intersubject variability. In this research, we address the issue of intersubject variability and show our success using 2D spectrograms and the attention mechanism of transformer. Experiments show that vision transformer handles the effect of intersubject variability much better than CNN-based models and beats all previous state-of-the-art methods by a considerable margin. Moreover, our method is end-to-end, does not require handcrafted features, and can learn robust representations. The proposed method achieved 71.01% and 76.7% accuracies with RML dataset and WESAD dataset respectively for three class classification and 88.3% for binary classification on WESAD.

Stress Classification from ECG Signals Using Vision Transformer

Abstract

Vision Transformers have shown tremendous success in numerous computer vision applications; however, they have not been exploited for stress assessment using physiological signals such as Electrocardiogram (ECG). In order to get the maximum benefit from the vision transformer for multilevel stress assessment, in this paper, we transform the raw ECG data into 2D spectrograms using short time Fourier transform (STFT). These spectrograms are divided into patches for feeding to the transformer encoder. We also perform experiments with 1D CNN and ResNet-18 (CNN model). We perform leave-onesubject-out cross validation (LOSOCV) experiments on WESAD and Ryerson Multimedia Lab (RML) dataset. One of the biggest challenges of LOSOCV based experiments is to tackle the problem of intersubject variability. In this research, we address the issue of intersubject variability and show our success using 2D spectrograms and the attention mechanism of transformer. Experiments show that vision transformer handles the effect of intersubject variability much better than CNN-based models and beats all previous state-of-the-art methods by a considerable margin. Moreover, our method is end-to-end, does not require handcrafted features, and can learn robust representations. The proposed method achieved 71.01% and 76.7% accuracies with RML dataset and WESAD dataset respectively for three class classification and 88.3% for binary classification on WESAD.

Paper Structure

This paper contains 19 sections, 5 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Overview of the proposed method. We transform an ECG signal into spectrogram which is then split into flattened patches, add position embeddings and then feed to the transformer encoder. Stress recognition is performed by adding an extra learnable “classification token” to the sequence.
  • Figure 2: Spectrograms representing low, medium, and high stress levels from the RML dataset. Each spectrogram displays the time–frequency distribution of the ECG signal.
  • Figure 3: Visualization of features in spectrogram
  • Figure 4: Figure shows one encoder block of a Vision Transformer. This block is repeated $L$ times in the full encoder as indicated by $L_{X}$ at the top.
  • Figure 5: Visualization of attention maps for three stress levels. (a) Low stress, (b) Medium stress, and (c) High stress, each showing the spectrogram with attention maps extracted from the 1st, 5th, and 10th encoder layers.
  • ...and 2 more figures