Exploring Resolution-Wise Shared Attention in Hybrid Mamba-U-Nets for Improved Cross-Corpus Speech Enhancement
Nikolai Lund Kühne, Jesper Jensen, Jan Østergaard, Zheng-Hua Tan
TL;DR
This work tackles robustness in cross-corpus speech enhancement by introducing a hybrid Mamba-attention U-Net that employs resolution-wise shared attention (RWSA) across corresponding time and frequency scales. The RWSA-MambaUNet integrates MambAttention blocks with a U-Net architecture and shared attention across the downsampling and upsampling paths to align global temporal and spectral relations. Empirical results on two out-of-domain test sets (DNS 2020 and EARS-WHAM_v2) demonstrate state-of-the-art generalization with small models and also reveal substantial reductions in parameters and FLOPs compared to baselines. The approach offers a practical, efficient route to robust speech enhancement across diverse acoustic conditions, with code made publicly available.
Abstract
Recent advances in speech enhancement have shown that models combining Mamba and attention mechanisms yield superior cross-corpus generalization performance. At the same time, integrating Mamba in a U-Net structure has yielded state-of-the-art enhancement performance, while reducing both model size and computational complexity. Inspired by these insights, we propose RWSA-MambaUNet, a novel and efficient hybrid model combining Mamba and multi-head attention in a U-Net structure for improved cross-corpus performance. Resolution-wise shared attention (RWSA) refers to layerwise attention-sharing across corresponding time- and frequency resolutions. Our best-performing RWSA-MambaUNet model achieves state-of-the-art generalization performance on two out-of-domain test sets. Notably, our smallest model surpasses all baselines on the out-of-domain DNS 2020 test set in terms of PESQ, SSNR, and ESTOI, and on the out-of-domain EARS-WHAM_v2 test set in terms of SSNR, ESTOI, and SI-SDR, while using less than half the model parameters and a fraction of the FLOPs.
