Table of Contents
Fetching ...

Exploring Resolution-Wise Shared Attention in Hybrid Mamba-U-Nets for Improved Cross-Corpus Speech Enhancement

Nikolai Lund Kühne, Jesper Jensen, Jan Østergaard, Zheng-Hua Tan

TL;DR

This work tackles robustness in cross-corpus speech enhancement by introducing a hybrid Mamba-attention U-Net that employs resolution-wise shared attention (RWSA) across corresponding time and frequency scales. The RWSA-MambaUNet integrates MambAttention blocks with a U-Net architecture and shared attention across the downsampling and upsampling paths to align global temporal and spectral relations. Empirical results on two out-of-domain test sets (DNS 2020 and EARS-WHAM_v2) demonstrate state-of-the-art generalization with small models and also reveal substantial reductions in parameters and FLOPs compared to baselines. The approach offers a practical, efficient route to robust speech enhancement across diverse acoustic conditions, with code made publicly available.

Abstract

Recent advances in speech enhancement have shown that models combining Mamba and attention mechanisms yield superior cross-corpus generalization performance. At the same time, integrating Mamba in a U-Net structure has yielded state-of-the-art enhancement performance, while reducing both model size and computational complexity. Inspired by these insights, we propose RWSA-MambaUNet, a novel and efficient hybrid model combining Mamba and multi-head attention in a U-Net structure for improved cross-corpus performance. Resolution-wise shared attention (RWSA) refers to layerwise attention-sharing across corresponding time- and frequency resolutions. Our best-performing RWSA-MambaUNet model achieves state-of-the-art generalization performance on two out-of-domain test sets. Notably, our smallest model surpasses all baselines on the out-of-domain DNS 2020 test set in terms of PESQ, SSNR, and ESTOI, and on the out-of-domain EARS-WHAM_v2 test set in terms of SSNR, ESTOI, and SI-SDR, while using less than half the model parameters and a fraction of the FLOPs.

Exploring Resolution-Wise Shared Attention in Hybrid Mamba-U-Nets for Improved Cross-Corpus Speech Enhancement

TL;DR

This work tackles robustness in cross-corpus speech enhancement by introducing a hybrid Mamba-attention U-Net that employs resolution-wise shared attention (RWSA) across corresponding time and frequency scales. The RWSA-MambaUNet integrates MambAttention blocks with a U-Net architecture and shared attention across the downsampling and upsampling paths to align global temporal and spectral relations. Empirical results on two out-of-domain test sets (DNS 2020 and EARS-WHAM_v2) demonstrate state-of-the-art generalization with small models and also reveal substantial reductions in parameters and FLOPs compared to baselines. The approach offers a practical, efficient route to robust speech enhancement across diverse acoustic conditions, with code made publicly available.

Abstract

Recent advances in speech enhancement have shown that models combining Mamba and attention mechanisms yield superior cross-corpus generalization performance. At the same time, integrating Mamba in a U-Net structure has yielded state-of-the-art enhancement performance, while reducing both model size and computational complexity. Inspired by these insights, we propose RWSA-MambaUNet, a novel and efficient hybrid model combining Mamba and multi-head attention in a U-Net structure for improved cross-corpus performance. Resolution-wise shared attention (RWSA) refers to layerwise attention-sharing across corresponding time- and frequency resolutions. Our best-performing RWSA-MambaUNet model achieves state-of-the-art generalization performance on two out-of-domain test sets. Notably, our smallest model surpasses all baselines on the out-of-domain DNS 2020 test set in terms of PESQ, SSNR, and ESTOI, and on the out-of-domain EARS-WHAM_v2 test set in terms of SSNR, ESTOI, and SI-SDR, while using less than half the model parameters and a fraction of the FLOPs.

Paper Structure

This paper contains 13 sections, 2 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Overall structure of our proposed RWSA-MambaUNet. Resolution-wise shared attention (purple dashed lines) is layerwise sharing of MHA modules within MambAttention blocks across corresponding resolutions between the upsampling and downsampling path. To simplify the figure, we have not depicted the residual connections between the output of the feature encoder, and the outputs of both refinement layers.