Table of Contents
Fetching ...

NeuroSpex: Neuro-Guided Speaker Extraction with Cross-Modal Attention

Dashanka De Silva, Siqi Cai, Saurav Pahuja, Tanja Schultz, Haizhou Li

TL;DR

NeuroSpex introduces an end-to-end neuro-guided speaker extraction framework that uses EEG as the sole auxiliary cue to isolate the attended speech from a mono mixture. It fuses speech and EEG features via cross-modal attention and deep EEG encoding with AdC blocks, enabling robust attended-speech extraction without pre-enrolled references. Across ablations and baseline comparisons on the public KUL dataset, NeuroSpex achieves significant gains in SI-SDR, SI-SDRi, PESQ, and STOI over prior neuro-steered methods while maintaining reasonable parameter efficiency. The work demonstrates the value of combining temporal EEG dynamics with spatial-spectral speech representations for cocktail party scenarios, and suggests directions toward subject-independent and speaker-specific extensions.

Abstract

In the study of auditory attention, it has been revealed that there exists a robust correlation between attended speech and elicited neural responses, measurable through electroencephalography (EEG). Therefore, it is possible to use the attention information available within EEG signals to guide the extraction of the target speaker in a cocktail party computationally. In this paper, we present a neuro-guided speaker extraction model, i.e. NeuroSpex, using the EEG response of the listener as the sole auxiliary reference cue to extract attended speech from monaural speech mixtures. We propose a novel EEG signal encoder that captures the attention information. Additionally, we propose a cross-attention (CA) mechanism to enhance the speech feature representations, generating a speaker extraction mask. Experimental results on a publicly available dataset demonstrate that our proposed model outperforms two baseline models across various evaluation metrics.

NeuroSpex: Neuro-Guided Speaker Extraction with Cross-Modal Attention

TL;DR

NeuroSpex introduces an end-to-end neuro-guided speaker extraction framework that uses EEG as the sole auxiliary cue to isolate the attended speech from a mono mixture. It fuses speech and EEG features via cross-modal attention and deep EEG encoding with AdC blocks, enabling robust attended-speech extraction without pre-enrolled references. Across ablations and baseline comparisons on the public KUL dataset, NeuroSpex achieves significant gains in SI-SDR, SI-SDRi, PESQ, and STOI over prior neuro-steered methods while maintaining reasonable parameter efficiency. The work demonstrates the value of combining temporal EEG dynamics with spatial-spectral speech representations for cocktail party scenarios, and suggests directions toward subject-independent and speaker-specific extensions.

Abstract

In the study of auditory attention, it has been revealed that there exists a robust correlation between attended speech and elicited neural responses, measurable through electroencephalography (EEG). Therefore, it is possible to use the attention information available within EEG signals to guide the extraction of the target speaker in a cocktail party computationally. In this paper, we present a neuro-guided speaker extraction model, i.e. NeuroSpex, using the EEG response of the listener as the sole auxiliary reference cue to extract attended speech from monaural speech mixtures. We propose a novel EEG signal encoder that captures the attention information. Additionally, we propose a cross-attention (CA) mechanism to enhance the speech feature representations, generating a speaker extraction mask. Experimental results on a publicly available dataset demonstrate that our proposed model outperforms two baseline models across various evaluation metrics.
Paper Structure (18 sections, 9 equations, 4 figures, 3 tables)

This paper contains 18 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The block diagram of NeuroSpex with all the prominent components and connections. NeuroSpex has $m$ cascaded blocks of cross-attention (CA) and TCN in the Speaker Extractor, and $n$ cascaded AdC blocks in the EEG encoder. NeuroSpex takes a speech mixture $x$ and EEG signal $y$ as input, generates target speech $s$. Here, $X$ and $S$ represent utterance-level embeddings for speech mixture and target speech, respectively. $Y$ represents the reference signal, where $Y_{0}$ and $Y^{\prime}$ denote the output of the pre-convolution (preConv) layer and the interpolated reference signal, respectively. $M$ represents the generated mask. $\oplus$ and $\otimes$ refer to the residual connection and the element-wise multiplication, respectively.
  • Figure 2: (a) CA block to fuse EEG and speech mixture embeddings in the speaker extractor. (b) AdC block from the EEG encoder including multi-head attention and depth-wise convolutions.
  • Figure 3: Violin plots of extracted speech SI-SDR improvement for each subject from the test set. Consistency in speech output is observed across subjects.
  • Figure 4: Violin plots of extracted speech SI-SDR improvement from the test set for NeuroSpex and 4 baseline models to compare output performances. Here, $\ast$ represents the statistical significance of the comparison (p < 0.001, paired t-test)