Table of Contents
Fetching ...

SWIM: Short-Window CNN Integrated with Mamba for EEG-Based Auditory Spatial Attention Decoding

Ziyang Zhang, Andrew Thwaites, Alexandra Woolgar, Brian Moore, Chao Zhang

TL;DR

A new model named SWIM, a short-window convolution neural network integrated with Mamba, is proposed for identifying the locus of auditory attention (left or right) from electroencephalography (EEG) signals without relying on speech envelopes.

Abstract

In complex auditory environments, the human auditory system possesses the remarkable ability to focus on a specific speaker while disregarding others. In this study, a new model named SWIM, a short-window convolution neural network (CNN) integrated with Mamba, is proposed for identifying the locus of auditory attention (left or right) from electroencephalography (EEG) signals without relying on speech envelopes. SWIM consists of two parts. The first is a short-window CNN (SW$_\text{CNN}$), which acts as a short-term EEG feature extractor and achieves a final accuracy of 84.9% in the leave-one-speaker-out setup on the widely used KUL dataset. This improvement is due to the use of an improved CNN structure, data augmentation, multitask training, and model combination. The second part, Mamba, is a sequence model first applied to auditory spatial attention decoding to leverage the long-term dependency from previous SW$_\text{CNN}$ time steps. By joint training SW$_\text{CNN}$ and Mamba, the proposed SWIM structure uses both short-term and long-term information and achieves an accuracy of 86.2%, which reduces the classification errors by a relative 31.0% compared to the previous state-of-the-art result. The source code is available at https://github.com/windowso/SWIM-ASAD.

SWIM: Short-Window CNN Integrated with Mamba for EEG-Based Auditory Spatial Attention Decoding

TL;DR

A new model named SWIM, a short-window convolution neural network integrated with Mamba, is proposed for identifying the locus of auditory attention (left or right) from electroencephalography (EEG) signals without relying on speech envelopes.

Abstract

In complex auditory environments, the human auditory system possesses the remarkable ability to focus on a specific speaker while disregarding others. In this study, a new model named SWIM, a short-window convolution neural network (CNN) integrated with Mamba, is proposed for identifying the locus of auditory attention (left or right) from electroencephalography (EEG) signals without relying on speech envelopes. SWIM consists of two parts. The first is a short-window CNN (SW), which acts as a short-term EEG feature extractor and achieves a final accuracy of 84.9% in the leave-one-speaker-out setup on the widely used KUL dataset. This improvement is due to the use of an improved CNN structure, data augmentation, multitask training, and model combination. The second part, Mamba, is a sequence model first applied to auditory spatial attention decoding to leverage the long-term dependency from previous SW time steps. By joint training SW and Mamba, the proposed SWIM structure uses both short-term and long-term information and achieves an accuracy of 86.2%, which reduces the classification errors by a relative 31.0% compared to the previous state-of-the-art result. The source code is available at https://github.com/windowso/SWIM-ASAD.
Paper Structure (19 sections, 2 equations, 7 figures, 2 tables)

This paper contains 19 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The architecture of SW$_\text{CNN}$. The input is a decision window of an EEG signal with 64 channels and $T$ samples. The output is logits for classifying attention location and subject ID. In this figure, the number in front of the @ represents the model channel dimension, and the + represents the vector concatenation of two dimensions.
  • Figure 2: The architecture of SWIM. The SW$_\text{CNN}$ is shown in Fig. \ref{['fig:cnn']} with the classification head removed, so the output of SW$_\text{CNN}$ is a 64-dim hidden feature. The hidden features from history windows are concatenated with it from the current window as input of Mamba. Then Mamba utilize this input to classify the auditory attention direction of the current window. In this figure, $\times$ means multiplication and $\sigma$ means an activation in the Mamba block.
  • Figure 3: In the left figure, the two boxes represent two decision windows. If the overlapping ratio is not zero, then they will have overlapping regions. In the right figure, the masking region in a decision window will be set to zero.
  • Figure 4: The results are all on Leave-one-speaker-out setup. The mean, maximum and minimum accuracies of three results from different random seeds are shown. Subfigures (a), (b), and (c) respectively depict the changes in ASAD accuracy of SW$_\text{CNN}$ influenced by (a) overlapping ratio $\alpha$, (b) time masking ratio $\beta$, and (c) auxiliary loss weighting factor $\gamma$.
  • Figure 5: The results of SWIM, SWIT and SW$_\text{CNN}$ in Leave-one-speaker-out setup. SWIT refers to the model obtained by replacing Mamba in SWIM with a Transformer. The $x$-axis is the window length the model could use during the test. SWIM achieves the highest accuracy 86.2% when the window length is 50s, while the accuracy of SWIT and SW$_\text{CNN}$ is 85.0% and 84.4% respectively.
  • ...and 2 more figures