Table of Contents
Fetching ...

Linear stimulus reconstruction works on the KU Leuven audiovisual, gaze-controlled auditory attention decoding dataset

Simon Geirnaert, Iustina Rotaru, Tom Francart, Alexander Bertrand

TL;DR

This work evaluates linear stimulus reconstruction as a baseline for auditory attention decoding on the AV-GC-AAD dataset, addressing gaze-related confounds in Sp-AAD approaches. The attended speech envelope is reconstructed from EEG via a spatio-temporal backward model, with the decoder trained by minimizing $\|\mathbf{s}_a - \mathbf{X} \mathbf{d}\|_2^2$ and regularized using Ledoit-Wolf; decisions are made by comparing Pearson correlations to the competing envelopes. Results show significant AAD accuracy within each condition and strong generalization across conditions, across new subjects, and even across datasets, supporting the claim that the AV-GC-AAD data are decodable with simple linear models. The authors provide a reproducible baseline procedure and code to benchmark future AAD algorithms on this challenging dataset.

Abstract

In a recent paper, we presented the KU Leuven audiovisual, gaze-controlled auditory attention decoding (AV-GC-AAD) dataset, in which we recorded electroencephalography (EEG) signals of participants attending to one out of two competing speakers under various audiovisual conditions. The main goal of this dataset was to disentangle the direction of gaze from the direction of auditory attention, in order to reveal gaze-related shortcuts in existing spatial AAD algorithms that aim to decode the (direction of) auditory attention directly from the EEG. Various methods based on spatial AAD do not achieve significant above-chance performances on our AV-GC-AAD dataset, indicating that previously reported results were mainly driven by eye gaze confounds in existing datasets. Still, these adverse outcomes are often discarded for reasons that are attributed to the limitations of the AV-GC-AAD dataset, such as the limited amount of data to train a working model, too much data heterogeneity due to different audiovisual conditions, or participants allegedly being unable to focus their auditory attention under the complex instructions. In this paper, we present the results of the linear stimulus reconstruction AAD algorithm and show that high AAD accuracy can be obtained within each individual condition and that the model generalizes across conditions, across new subjects, and even across datasets. Therefore, we eliminate any doubts that the inadequacy of the AV-GC-AAD dataset is the primary reason for the (spatial) AAD algorithms failing to achieve above-chance performance when compared to other datasets. Furthermore, this report provides a simple baseline evaluation procedure (including source code) that can serve as the minimal benchmark for all future AAD algorithms evaluated on this dataset.

Linear stimulus reconstruction works on the KU Leuven audiovisual, gaze-controlled auditory attention decoding dataset

TL;DR

This work evaluates linear stimulus reconstruction as a baseline for auditory attention decoding on the AV-GC-AAD dataset, addressing gaze-related confounds in Sp-AAD approaches. The attended speech envelope is reconstructed from EEG via a spatio-temporal backward model, with the decoder trained by minimizing and regularized using Ledoit-Wolf; decisions are made by comparing Pearson correlations to the competing envelopes. Results show significant AAD accuracy within each condition and strong generalization across conditions, across new subjects, and even across datasets, supporting the claim that the AV-GC-AAD data are decodable with simple linear models. The authors provide a reproducible baseline procedure and code to benchmark future AAD algorithms on this challenging dataset.

Abstract

In a recent paper, we presented the KU Leuven audiovisual, gaze-controlled auditory attention decoding (AV-GC-AAD) dataset, in which we recorded electroencephalography (EEG) signals of participants attending to one out of two competing speakers under various audiovisual conditions. The main goal of this dataset was to disentangle the direction of gaze from the direction of auditory attention, in order to reveal gaze-related shortcuts in existing spatial AAD algorithms that aim to decode the (direction of) auditory attention directly from the EEG. Various methods based on spatial AAD do not achieve significant above-chance performances on our AV-GC-AAD dataset, indicating that previously reported results were mainly driven by eye gaze confounds in existing datasets. Still, these adverse outcomes are often discarded for reasons that are attributed to the limitations of the AV-GC-AAD dataset, such as the limited amount of data to train a working model, too much data heterogeneity due to different audiovisual conditions, or participants allegedly being unable to focus their auditory attention under the complex instructions. In this paper, we present the results of the linear stimulus reconstruction AAD algorithm and show that high AAD accuracy can be obtained within each individual condition and that the model generalizes across conditions, across new subjects, and even across datasets. Therefore, we eliminate any doubts that the inadequacy of the AV-GC-AAD dataset is the primary reason for the (spatial) AAD algorithms failing to achieve above-chance performance when compared to other datasets. Furthermore, this report provides a simple baseline evaluation procedure (including source code) that can serve as the minimal benchmark for all future AAD algorithms evaluated on this dataset.

Paper Structure

This paper contains 15 sections, 9 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: An overview of the linear stimulus decoding algorithm for AAD, in which the attended speech envelope is reconstructed from the neural responses and correlated with the presented speech envelopes to identify the attended one through the Pearson correlation coefficient. Based on Figure 3a in geirnaert2021eegBased and Figure 2 in geirnaert2024fast.
  • Figure 2: Using leave-one-trial-out CV for subject-specific decoding per individual condition leads to significant AAD performances for every single condition, even when the visual instruction is incongruent with the direction of auditory attention (moving video and moving target + noise). Gray lines are replicas per condition, provided as a reference.
  • Figure 3: (a) Using leave-one-trial-out or leave-one-condition-out CV across all conditions at once both leads to significant and expected AAD performances. (b) A breakdown of the leave-one-condition-out CV accuracies per condition shows that generalization to every other condition is possible and similar. One dot represent the average $60s$ AAD accuracy for one subject.
  • Figure 4: Using leave-one-subject-out CV and generalizing from the KU Leuven AAD dataset 2016 (i.e., subject-independent decoding) leads to significant AAD performances, showing that generalization across subjects and datasets is possible.