CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation
Vahid Ahmadi Kalkhorani, DeLiang Wang
TL;DR
CrossNet presents a complex spectral mapping framework for single- and multi-channel speaker separation in reverberant and noisy environments. It integrates a variance-normalized encoder, a global multi-head self-attention module, cross-band and narrow-band processing, and a novel random chunk positional encoding to generalize across long utterances. The model achieves state-of-the-art results on WSJ0-2mix, WHAMR!, and SMS-WSJ with competitive computational efficiency, and demonstrates strong performance gains in multi-channel setups, including near-oracle ASR performance. The work contributes a scalable, robust architecture that leverages global and local spectral correlations to improve separation and speech enhancement in diverse acoustic scenarios.
Abstract
We introduce CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.
