Table of Contents
Fetching ...

CA-TCN: A Causal-Anticausal Temporal Convolutional Network for Direct Auditory Attention Decoding

Iñigo García-Ugarte, Rubén Eguinoa, Ricardo San Martín, Daniel Paternain, Carmen Vidaurre

Abstract

A promising approach for steering auditory attention in complex listening environments relies on Auditory Attention Decoding (AAD), which aim to identify the attended speech stream in a multiple speaker scenario from neural recordings. Entrainment-based AAD approaches, typically assume access to clean speech sources and electroencephalography (EEG) signals to exploit low-frequency correlations between the neural response and the attended stimulus. In this study, we propose CA-TCN, a Causal-Anticausal Temporal Convolutional Network that directly classifies the attended speaker. The proposed architecture integrates several best practices from convolutional neural networks in sequence processing tasks. Importantly, it explicitly aligns auditory stimuli and neural responses by employing separate causal and anticausal convolutions respectively, with distinct receptive fields operating in opposite temporal directions. Experimental results, obtained through comparisons with three baseline AAD models, demonstrated that CA-TCN consistently improved decoding accuracy across datasets and decision windows, with gains ranging from 0.5% to 3.2% for subject-independent models and from 0.8% to 2.9% for subject-specific models compared with the next best-performing model, AADNet. Moreover, these improvements were statistically significant in four of the six evaluated settings when comparing Minimum Expected Switch Duration distributions. Beyond accuracy, the model demonstrated spatial robustness across different conditions, as the EEG spatial filters exhibited stable patterns across datasets. Overall, this work introduces an accurate and unified AAD model that outperforms existing methods while considering practical benefits for online processing scenarios. These findings contribute to advancing the state of AAD and its applicability in real-world systems.

CA-TCN: A Causal-Anticausal Temporal Convolutional Network for Direct Auditory Attention Decoding

Abstract

A promising approach for steering auditory attention in complex listening environments relies on Auditory Attention Decoding (AAD), which aim to identify the attended speech stream in a multiple speaker scenario from neural recordings. Entrainment-based AAD approaches, typically assume access to clean speech sources and electroencephalography (EEG) signals to exploit low-frequency correlations between the neural response and the attended stimulus. In this study, we propose CA-TCN, a Causal-Anticausal Temporal Convolutional Network that directly classifies the attended speaker. The proposed architecture integrates several best practices from convolutional neural networks in sequence processing tasks. Importantly, it explicitly aligns auditory stimuli and neural responses by employing separate causal and anticausal convolutions respectively, with distinct receptive fields operating in opposite temporal directions. Experimental results, obtained through comparisons with three baseline AAD models, demonstrated that CA-TCN consistently improved decoding accuracy across datasets and decision windows, with gains ranging from 0.5% to 3.2% for subject-independent models and from 0.8% to 2.9% for subject-specific models compared with the next best-performing model, AADNet. Moreover, these improvements were statistically significant in four of the six evaluated settings when comparing Minimum Expected Switch Duration distributions. Beyond accuracy, the model demonstrated spatial robustness across different conditions, as the EEG spatial filters exhibited stable patterns across datasets. Overall, this work introduces an accurate and unified AAD model that outperforms existing methods while considering practical benefits for online processing scenarios. These findings contribute to advancing the state of AAD and its applicability in real-world systems.

Paper Structure

This paper contains 33 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) CA-TCN block diagram consisting of a Spatial Projection layer, a Temporal Convolutional Network (TCN) module, and a Classification module. EEG and audio signals are processed separately through two independent branches until classification module. (b) Design of the one-dimensional convolutional block within TCN.
  • Figure 2: Illustration of dilated convolutions in a three-layer TCN ($N=3$) with exponential dilation factors $d=[1,2,4]$. The yellow sample (decision point) depends on the blue samples conforming its receptive field. Convolutions are applied causally for the stimulus (b) and anticausally for the EEG (a), with zero-padding on the left or right, respectively.
  • Figure 3: Subject-independent (SI) distribution of MESD across the three datasets considered in this study. Each point corresponds to the MESD value obtained for a given subject, and median values are highlighted in the distributions. The number of subjects whose MESD exceeds 80 s are indicated at the top of the distributions in parentheses. Statistical analysis: ***: $p < 0.001$, **: $0.001\leq p<0.01$, *: $0.01\leq p<0.05$, None: $p\geq0.05$.
  • Figure 4: Subject-Specific (SS) distribution of MESD across the three datasets considered in this study. Each point corresponds to the average MESD obtained for a given subject, and median values are highlighted in the distributions.
  • Figure 5: Topographic maps of the first cluster obtained by grouping the Spatial Projection filters from finetuned subject-specific CA-TCN across the Jaulab (a), DTU (b), and KULeuven (c) datasets. Values below the topographic maps indicate the cosine similarity between each pair of clusters.