Table of Contents
Fetching ...

Binaural Selective Attention Model for Target Speaker Extraction

Hanyu Meng, Qiquan Zhang, Xiangyu Zhang, Vidhyasaharan Sethu, Eliathamby Ambikairajah

TL;DR

This work tackles Target Speaker Extraction in binaural, multi-speaker environments by modeling binaural selective hearing through a FaSNet-based, time-domain separator. It introduces two binaural interaction strategies—Cosine Similarity on time-domain frames and Inter-Channel Attention Correlation on learned spectral features—and implements them as Bi-CSim-TSE and Bi-IAC-TSE models, guided by a multi-head attention-based speaker embedding. The approach is evaluated on LibriSpeech data convolved with Surrey HRTFs, achieving best-in-class results with SI-SDR = 18.52 dB, SDR = 19.12 dB, and PESQ = 3.05 in anechoic two-speaker tests, outperforming monaural baselines and prior multichannel methods. The findings demonstrate the effectiveness of time-domain binaural beamforming and the superiority of cosine-based binaural interaction for preserving spatial cues, with the MHSA-based embedding providing robust guidance for target extraction.

Abstract

The remarkable ability of humans to selectively focus on a target speaker in cocktail party scenarios is facilitated by binaural audio processing. In this paper, we present a binaural time-domain Target Speaker Extraction model based on the Filter-and-Sum Network (FaSNet). Inspired by human selective hearing, our proposed model introduces target speaker embedding into separators using a multi-head attention-based selective attention block. We also compared two binaural interaction approaches -- the cosine similarity of time-domain signals and inter-channel correlation in learned spectral representations. Our experimental results show that our proposed model outperforms monaural configurations and state-of-the-art multi-channel target speaker extraction models, achieving best-in-class performance with 18.52 dB SI-SDR, 19.12 dB SDR, and 3.05 PESQ scores under anechoic two-speaker test configurations.

Binaural Selective Attention Model for Target Speaker Extraction

TL;DR

This work tackles Target Speaker Extraction in binaural, multi-speaker environments by modeling binaural selective hearing through a FaSNet-based, time-domain separator. It introduces two binaural interaction strategies—Cosine Similarity on time-domain frames and Inter-Channel Attention Correlation on learned spectral features—and implements them as Bi-CSim-TSE and Bi-IAC-TSE models, guided by a multi-head attention-based speaker embedding. The approach is evaluated on LibriSpeech data convolved with Surrey HRTFs, achieving best-in-class results with SI-SDR = 18.52 dB, SDR = 19.12 dB, and PESQ = 3.05 in anechoic two-speaker tests, outperforming monaural baselines and prior multichannel methods. The findings demonstrate the effectiveness of time-domain binaural beamforming and the superiority of cosine-based binaural interaction for preserving spatial cues, with the MHSA-based embedding providing robust guidance for target extraction.

Abstract

The remarkable ability of humans to selectively focus on a target speaker in cocktail party scenarios is facilitated by binaural audio processing. In this paper, we present a binaural time-domain Target Speaker Extraction model based on the Filter-and-Sum Network (FaSNet). Inspired by human selective hearing, our proposed model introduces target speaker embedding into separators using a multi-head attention-based selective attention block. We also compared two binaural interaction approaches -- the cosine similarity of time-domain signals and inter-channel correlation in learned spectral representations. Our experimental results show that our proposed model outperforms monaural configurations and state-of-the-art multi-channel target speaker extraction models, achieving best-in-class performance with 18.52 dB SI-SDR, 19.12 dB SDR, and 3.05 PESQ scores under anechoic two-speaker test configurations.
Paper Structure (18 sections, 9 equations, 4 figures, 1 table)

This paper contains 18 sections, 9 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: An illustration of binaural selective attention
  • Figure 2: An Overview of the proposed binaural target speaker extraction model
  • Figure 3: The structure of the speaker extractor to get the target speaker embedding
  • Figure 4: The monaural configuration