Table of Contents
Fetching ...

A contrastive-learning approach for auditory attention detection

Seyed Ali Alavi Bajestan, Mark Pitt, Donald S. Williamson

TL;DR

This paper proposes a method based on self supervised learning to minimize the difference between the latent representations of an attended speech signal and the corresponding EEG signal, which is further finetuned for the auditory attention classification task.

Abstract

Carrying conversations in multi-sound environments is one of the more challenging tasks, since the sounds overlap across time and frequency making it difficult to understand a single sound source. One proposed approach to help isolate an attended speech source is through decoding the electroencephalogram (EEG) and identifying the attended audio source using statistical or machine learning techniques. However, the limited amount of data in comparison to other machine learning problems and the distributional shift between different EEG recordings emphasizes the need for a self supervised approach that works with limited data to achieve a more robust solution. In this paper, we propose a method based on self supervised learning to minimize the difference between the latent representations of an attended speech signal and the corresponding EEG signal. This network is further finetuned for the auditory attention classification task. We compare our results with previously published methods and achieve state-of-the-art performance on the validation set.

A contrastive-learning approach for auditory attention detection

TL;DR

This paper proposes a method based on self supervised learning to minimize the difference between the latent representations of an attended speech signal and the corresponding EEG signal, which is further finetuned for the auditory attention classification task.

Abstract

Carrying conversations in multi-sound environments is one of the more challenging tasks, since the sounds overlap across time and frequency making it difficult to understand a single sound source. One proposed approach to help isolate an attended speech source is through decoding the electroencephalogram (EEG) and identifying the attended audio source using statistical or machine learning techniques. However, the limited amount of data in comparison to other machine learning problems and the distributional shift between different EEG recordings emphasizes the need for a self supervised approach that works with limited data to achieve a more robust solution. In this paper, we propose a method based on self supervised learning to minimize the difference between the latent representations of an attended speech signal and the corresponding EEG signal. This network is further finetuned for the auditory attention classification task. We compare our results with previously published methods and achieve state-of-the-art performance on the validation set.

Paper Structure

This paper contains 9 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: A high-level depiction of the proposed network. The preprocessed audio and EEG data have a Gaussian noise added to them. Each triplet is fed to the two CMAA encoders with parameter sharing. The encoder outputs are provided to two sets of probe and classification heads. The probe head receives a representation from the CMAA encoder, where a shallow network was chosen so the CMAA module would be as descriptive as possible. The classification heads each find a boundary within the latent representation space. Two CLAAD losses and a classification loss are then computed.
  • Figure 2: The cross attention module takes the preprocessed EEG and audio at the input at the first iteration. In the subsequent iterations the EEG input is replaced by the output of the previous stage.
  • Figure 3: The average per subject accuracy for the validation set (Top), where the error bars represent the maximum and minimum accuracy of different folds. The results when the subject data is unseen is shown below (bottom).