Table of Contents
Fetching ...

CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation

Vahid Ahmadi Kalkhorani, DeLiang Wang

TL;DR

CrossNet presents a complex spectral mapping framework for single- and multi-channel speaker separation in reverberant and noisy environments. It integrates a variance-normalized encoder, a global multi-head self-attention module, cross-band and narrow-band processing, and a novel random chunk positional encoding to generalize across long utterances. The model achieves state-of-the-art results on WSJ0-2mix, WHAMR!, and SMS-WSJ with competitive computational efficiency, and demonstrates strong performance gains in multi-channel setups, including near-oracle ASR performance. The work contributes a scalable, robust architecture that leverages global and local spectral correlations to improve separation and speech enhancement in diverse acoustic scenarios.

Abstract

We introduce CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.

CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation

TL;DR

CrossNet presents a complex spectral mapping framework for single- and multi-channel speaker separation in reverberant and noisy environments. It integrates a variance-normalized encoder, a global multi-head self-attention module, cross-band and narrow-band processing, and a novel random chunk positional encoding to generalize across long utterances. The model achieves state-of-the-art results on WSJ0-2mix, WHAMR!, and SMS-WSJ with competitive computational efficiency, and demonstrates strong performance gains in multi-channel setups, including near-oracle ASR performance. The work contributes a scalable, robust architecture that leverages global and local spectral correlations to improve separation and speech enhancement in diverse acoustic scenarios.

Abstract

We introduce CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.
Paper Structure (23 sections, 5 equations, 3 figures, 6 tables)

This paper contains 23 sections, 5 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Diagram of the proposed CrossNet architecture, with $\hat{s}_1$ and $\hat{s}_2$ denoting separated speaker signals.
  • Figure 2: CrossNet building blocks. (a) Global multi-head self-attention module. (b) Cross-band module. (c) Narrow-band module.
  • Figure 3: Effects of sequence length on the performance of CrossNet and SpatialNet. Speaker separation performance is plotted for different intervals of mixture lengths (in seconds).