Table of Contents
Fetching ...

Scattering Transform for Auditory Attention Decoding

René Pallenberg, Fabrice Katzberg, Alfred Mertins, Marco Maass

TL;DR

It is shown that the two-layer scattering transform can significantly improve the performance for subject-related conditions, especially on the KUL dataset, and on the DTU dataset, this suggests that the scattering transform is capable of extracting additional relevant information.

Abstract

The use of hearing aids will increase in the coming years due to demographic change. One open problem that remains to be solved by a new generation of hearing aids is the cocktail party problem. A possible solution is electroencephalography-based auditory attention decoding. This has been the subject of several studies in recent years, which have in common that they use the same preprocessing methods in most cases. In this work, in order to achieve an advantage, the use of a scattering transform is proposed as an alternative to these preprocessing methods. The two-layer scattering transform is compared with a regular filterbank, the synchrosqueezing short-time Fourier transform and the common preprocessing. To demonstrate the performance, the known and the proposed preprocessing methods are compared for different classification tasks on two widely used datasets, provided by the KU Leuven (KUL) and the Technical University of Denmark (DTU). Both established and new neural-network-based models, CNNs, LSTMs, and recent Transformer/graph-based models are used for classification. Various evaluation strategies were compared, with a focus on the task of classifying speakers who are unknown from the training. We show that the two-layer scattering transform can significantly improve the performance for subject-related conditions, especially on the KUL dataset. However, on the DTU dataset, this only applies to some of the models, or when larger amounts of training data are provided, as in 10-fold cross-validation. This suggests that the scattering transform is capable of extracting additional relevant information.

Scattering Transform for Auditory Attention Decoding

TL;DR

It is shown that the two-layer scattering transform can significantly improve the performance for subject-related conditions, especially on the KUL dataset, and on the DTU dataset, this suggests that the scattering transform is capable of extracting additional relevant information.

Abstract

The use of hearing aids will increase in the coming years due to demographic change. One open problem that remains to be solved by a new generation of hearing aids is the cocktail party problem. A possible solution is electroencephalography-based auditory attention decoding. This has been the subject of several studies in recent years, which have in common that they use the same preprocessing methods in most cases. In this work, in order to achieve an advantage, the use of a scattering transform is proposed as an alternative to these preprocessing methods. The two-layer scattering transform is compared with a regular filterbank, the synchrosqueezing short-time Fourier transform and the common preprocessing. To demonstrate the performance, the known and the proposed preprocessing methods are compared for different classification tasks on two widely used datasets, provided by the KU Leuven (KUL) and the Technical University of Denmark (DTU). Both established and new neural-network-based models, CNNs, LSTMs, and recent Transformer/graph-based models are used for classification. Various evaluation strategies were compared, with a focus on the task of classifying speakers who are unknown from the training. We show that the two-layer scattering transform can significantly improve the performance for subject-related conditions, especially on the KUL dataset. However, on the DTU dataset, this only applies to some of the models, or when larger amounts of training data are provided, as in 10-fold cross-validation. This suggests that the scattering transform is capable of extracting additional relevant information.
Paper Structure (42 sections, 3 equations, 9 figures, 11 tables)

This paper contains 42 sections, 3 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Scattering transform applied to audio (top) and EEG at T8 (bottom) from KUL dataset. Layer 2 captures fine-grained temporal modulations invisible in Layer 1, shown by temporal striations (0.0-0.5s) and differentiated patterns during acoustically similar events (3.5-4.5s). Original signal and ST layers 1-2 with, $F_o=8$, $Q=8$.
  • Figure 2: Architecture of the models used. The input has $T$ time steps, $C_\mathrm{eeg}$ channels for the EEG, and $C_\mathrm{aud}$ channels for the audio signals. The inputs are on the left, and the output is on the right. LSTM-2 can only work with two speakers, while the LSTM-X can handle a variable number of speakers.
  • Figure 3: Kernel density plots comparing baseline (base), ST (scat), and SSQ-STFT preprocessing on DTU and KUL datasets. Accuracy distributions for trial-wise $5\times2$ cv across all subjects with $L_x=2s$ (CNN-C1, CNN-Dil) and $L_x=1s$ (others). ST parameters: $F_o=8\Hz$, $Q_a=Q_e=8$.
  • Figure 4: Kernel density plots comparing baseline (base), ST (scat), and SSQ-STFT preprocessing for speaker-wise evaluation on the KUL dataset. Accuracy distributions for trial-wise $5\times2$ cv across all subjects with $L_x=2s$ (CNN-C1, CNN-Dil) and $L_x=1s$ (others). ST parameters:$F_o=8\Hz$, $Q_a=Q_e=8$.
  • Figure 5: Collection of kernel density plots to compare the results of the scattering pipeline for different $F_o$ and the baseline pipeline for the speaker-wise evaluation on the KUL dataset, with $L_x=2s$ for the LSMT-2. Each plot shows the accuracy distribution of the different runs over all subjects. $Q_a=Q_e=8$ for the scattering pipeline.
  • ...and 4 more figures