Table of Contents
Fetching ...

Improved Feature Extraction Network for Neuro-Oriented Target Speaker Extraction

Cunhang Fan, Youdian Gao, Zexu Pan, Jingjing Zhang, Hongyu Zhang, Jie Zhang, Zhao Lv

TL;DR

IFENet tackles neuro-oriented target speaker extraction by combining a SpeechBiMamba-based speech encoder and an EEGKAN-based EEG encoder within a time-domain, end-to-end framework. By fusing speech and EEG features through CMCA and optimizing with a negative SI-SDR loss, it achieves substantial relative gains on SI-SDR and perceptual metrics across KUL and AVED datasets, demonstrating effective long-sequence speech modeling and attention-guided target localization. The key contributions are the introduction of SpeechBiMamba for long-range speech modeling and EEGKAN for EEG-driven target localization, yielding robust performance without prior target speaker information and highlighting the importance of EEG features in attentive speech extraction. This approach has practical implications for brain-informed hearing aids and other EEG-assisted audio processing applications, with potential for further multimodal extensions.

Abstract

The recent rapid development of auditory attention decoding (AAD) offers the possibility of using electroencephalography (EEG) as auxiliary information for target speaker extraction. However, effectively modeling long sequences of speech and resolving the identity of the target speaker from EEG signals remains a major challenge. In this paper, an improved feature extraction network (IFENet) is proposed for neuro-oriented target speaker extraction, which mainly consists of a speech encoder with dual-path Mamba and an EEG encoder with Kolmogorov-Arnold Networks (KAN). We propose SpeechBiMamba, which makes use of dual-path Mamba in modeling local and global speech sequences to extract speech features. In addition, we propose EEGKAN to effectively extract EEG features that are closely related to the auditory stimuli and locate the target speaker through the subject's attention information. Experiments on the KUL and AVED datasets show that IFENet outperforms the state-of-the-art model, achieving 36\% and 29\% relative improvements in terms of scale-invariant signal-to-distortion ratio (SI-SDR) under an open evaluation condition.

Improved Feature Extraction Network for Neuro-Oriented Target Speaker Extraction

TL;DR

IFENet tackles neuro-oriented target speaker extraction by combining a SpeechBiMamba-based speech encoder and an EEGKAN-based EEG encoder within a time-domain, end-to-end framework. By fusing speech and EEG features through CMCA and optimizing with a negative SI-SDR loss, it achieves substantial relative gains on SI-SDR and perceptual metrics across KUL and AVED datasets, demonstrating effective long-sequence speech modeling and attention-guided target localization. The key contributions are the introduction of SpeechBiMamba for long-range speech modeling and EEGKAN for EEG-driven target localization, yielding robust performance without prior target speaker information and highlighting the importance of EEG features in attentive speech extraction. This approach has practical implications for brain-informed hearing aids and other EEG-assisted audio processing applications, with potential for further multimodal extensions.

Abstract

The recent rapid development of auditory attention decoding (AAD) offers the possibility of using electroencephalography (EEG) as auxiliary information for target speaker extraction. However, effectively modeling long sequences of speech and resolving the identity of the target speaker from EEG signals remains a major challenge. In this paper, an improved feature extraction network (IFENet) is proposed for neuro-oriented target speaker extraction, which mainly consists of a speech encoder with dual-path Mamba and an EEG encoder with Kolmogorov-Arnold Networks (KAN). We propose SpeechBiMamba, which makes use of dual-path Mamba in modeling local and global speech sequences to extract speech features. In addition, we propose EEGKAN to effectively extract EEG features that are closely related to the auditory stimuli and locate the target speaker through the subject's attention information. Experiments on the KUL and AVED datasets show that IFENet outperforms the state-of-the-art model, achieving 36\% and 29\% relative improvements in terms of scale-invariant signal-to-distortion ratio (SI-SDR) under an open evaluation condition.
Paper Structure (21 sections, 3 equations, 2 figures, 2 tables)

This paper contains 21 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The overall structure of IFENet.
  • Figure 2: (a) Mamba block, (b) EEGKAN layer.