Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement
Wenze Ren, Haibin Wu, Yi-Cheng Lin, Xuanjun Chen, Rong Chao, Kuo-Hsuan Hung, You-Jin Li, Wen-Yuan Ting, Hsin-Min Wang, Yu Tsao
TL;DR
The paper tackles multichannel speech enhancement by simultaneously leveraging spatial cues across microphones and spectral content across frequency bands. It introduces MCMamba, a four-module architecture built on the Mamba state-space model, with Uni- and Bi-directional variants to support causal (online) and non-causal (offline) processing. The approach jointly models full-band and narrow-band spatial information with sub-band and full-band spectral features, using four dedicated modules (full-band spatial, narrow-band spatial, sub-band spectral, full-band spectral) and selective SSM blocks to capture long-range dependencies. Empirical results on the CHiME-3 dataset show state-of-the-art performance, with particular strength in spectral modeling, and the ablation studies confirm the advantage of Mamba-based spectral processing for multichannel SE.
Abstract
In multichannel speech enhancement, effectively capturing spatial and spectral information across different microphones is crucial for noise reduction. Traditional methods, such as CNN or LSTM, attempt to model the temporal dynamics of full-band and sub-band spectral and spatial features. However, these approaches face limitations in fully modeling complex temporal dependencies, especially in dynamic acoustic environments. To overcome these challenges, we modify the current advanced model McNet by introducing an improved version of Mamba, a state-space model, and further propose MCMamba. MCMamba has been completely reengineered to integrate full-band and narrow-band spatial information with sub-band and full-band spectral features, providing a more comprehensive approach to modeling spatial and spectral information. Our experimental results demonstrate that MCMamba significantly improves the modeling of spatial and spectral features in multichannel speech enhancement, outperforming McNet and achieving state-of-the-art performance on the CHiME-3 dataset. Additionally, we find that Mamba performs exceptionally well in modeling spectral information.
