CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation

Vahid Ahmadi Kalkhorani; DeLiang Wang

CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation

Vahid Ahmadi Kalkhorani, DeLiang Wang

TL;DR

CrossNet presents a complex spectral mapping framework for single- and multi-channel speaker separation in reverberant and noisy environments. It integrates a variance-normalized encoder, a global multi-head self-attention module, cross-band and narrow-band processing, and a novel random chunk positional encoding to generalize across long utterances. The model achieves state-of-the-art results on WSJ0-2mix, WHAMR!, and SMS-WSJ with competitive computational efficiency, and demonstrates strong performance gains in multi-channel setups, including near-oracle ASR performance. The work contributes a scalable, robust architecture that leverages global and local spectral correlations to improve separation and speech enhancement in diverse acoustic scenarios.

Abstract

We introduce CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.

CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation

TL;DR

Abstract

Paper Structure (23 sections, 5 equations, 3 figures, 6 tables)

This paper contains 23 sections, 5 equations, 3 figures, 6 tables.

Introduction
Problem statement
CrossNet
Encoder layer
Random chunk positional encoding
Global multi-head self-attention module
Cross-band module
Narrow-band module
Output layer
Loss functions
Experimental Setup
Datasets
Network configuration
Evaluation metrics
Evaluation Results
...and 8 more sections

Figures (3)

Figure 1: Diagram of the proposed CrossNet architecture, with $\hat{s}_1$ and $\hat{s}_2$ denoting separated speaker signals.
Figure 2: CrossNet building blocks. (a) Global multi-head self-attention module. (b) Cross-band module. (c) Narrow-band module.
Figure 3: Effects of sequence length on the performance of CrossNet and SpatialNet. Speaker separation performance is plotted for different intervals of mixture lengths (in seconds).

CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation

TL;DR

Abstract

CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)