Table of Contents
Fetching ...

MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement

Nikolai Lund Kühne, Jesper Jensen, Jan Østergaard, Zheng-Hua Tan

TL;DR

The paper tackles the generalization gap in single-channel speech enhancement by introducing MambAttention, a hybrid architecture that shares time- and frequency-focused multi-head attention with Mamba blocks to better capture temporal and spectral structure. It introduces VB-DemandEx, a more challenging benchmark than VoiceBank+Demand, and demonstrates that MambAttention outperforms discriminative baselines and is competitive with diffusion models and language-model-based approaches on out-domain datasets, while maintaining strong in-domain performance. Key contributions include ablation evidence that weight-sharing and the attention-before-Mamba ordering enhance robustness, and the finding that augmenting LSTM/xLSTM with MHA improves generalization though still lags behind the proposed model. The work shows scalability benefits on large datasets and discusses practical considerations for computation and potential real-time extensions, offering a step toward robust, generalizable speech enhancement in real-world environments.

Abstract

With new sequence models like Mamba and xLSTM, several studies have shown that these models match or outperform the state-of-the-art in single-channel speech enhancement and audio representation learning. However, prior research has demonstrated that sequence models like LSTM and Mamba tend to overfit to the training set. To address this, previous works have shown that adding self-attention to LSTMs substantially improves generalization performance for single-channel speech enhancement. Nevertheless, neither the concept of hybrid Mamba and time-frequency attention models nor their generalization performance have been explored for speech enhancement. In this paper, we propose a novel hybrid architecture, MambAttention, which combines Mamba and shared time- and frequency-multi-head attention modules for generalizable single-channel speech enhancement. To train our model, we introduce VB-DemandEx, a dataset inspired by VoiceBank+Demand but with more challenging noise types and lower signal-to-noise ratios. Trained on VB-DemandEx, MambAttention significantly outperforms existing state-of-the-art discriminative LSTM-, xLSTM-, Mamba-, and Conformer-based systems of similar complexity across all reported metrics on two out-of-domain datasets: DNS 2020 without reverberation and EARS-WHAM_v2. MambAttention also matches or outperforms generative diffusion models in generalization performance while being competitive with language model baselines. Ablation studies highlight the importance of weight sharing between time- and frequency-multi-head attention modules for generalization performance. Finally, we explore integrating the shared time- and frequency-multi-head attention modules with LSTM and xLSTM, which yields a notable performance improvement on the out-of-domain datasets. Yet, MambAttention remains superior for cross-corpus generalization across all reported evaluation metrics.

MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement

TL;DR

The paper tackles the generalization gap in single-channel speech enhancement by introducing MambAttention, a hybrid architecture that shares time- and frequency-focused multi-head attention with Mamba blocks to better capture temporal and spectral structure. It introduces VB-DemandEx, a more challenging benchmark than VoiceBank+Demand, and demonstrates that MambAttention outperforms discriminative baselines and is competitive with diffusion models and language-model-based approaches on out-domain datasets, while maintaining strong in-domain performance. Key contributions include ablation evidence that weight-sharing and the attention-before-Mamba ordering enhance robustness, and the finding that augmenting LSTM/xLSTM with MHA improves generalization though still lags behind the proposed model. The work shows scalability benefits on large datasets and discusses practical considerations for computation and potential real-time extensions, offering a step toward robust, generalizable speech enhancement in real-world environments.

Abstract

With new sequence models like Mamba and xLSTM, several studies have shown that these models match or outperform the state-of-the-art in single-channel speech enhancement and audio representation learning. However, prior research has demonstrated that sequence models like LSTM and Mamba tend to overfit to the training set. To address this, previous works have shown that adding self-attention to LSTMs substantially improves generalization performance for single-channel speech enhancement. Nevertheless, neither the concept of hybrid Mamba and time-frequency attention models nor their generalization performance have been explored for speech enhancement. In this paper, we propose a novel hybrid architecture, MambAttention, which combines Mamba and shared time- and frequency-multi-head attention modules for generalizable single-channel speech enhancement. To train our model, we introduce VB-DemandEx, a dataset inspired by VoiceBank+Demand but with more challenging noise types and lower signal-to-noise ratios. Trained on VB-DemandEx, MambAttention significantly outperforms existing state-of-the-art discriminative LSTM-, xLSTM-, Mamba-, and Conformer-based systems of similar complexity across all reported metrics on two out-of-domain datasets: DNS 2020 without reverberation and EARS-WHAM_v2. MambAttention also matches or outperforms generative diffusion models in generalization performance while being competitive with language model baselines. Ablation studies highlight the importance of weight sharing between time- and frequency-multi-head attention modules for generalization performance. Finally, we explore integrating the shared time- and frequency-multi-head attention modules with LSTM and xLSTM, which yields a notable performance improvement on the out-of-domain datasets. Yet, MambAttention remains superior for cross-corpus generalization across all reported evaluation metrics.

Paper Structure

This paper contains 23 sections, 16 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overall structure of our proposed MambAttention model. $M$, $K$, $T$, and $F'$ represent the batch size, the number of channels, the number of time frames, and the number of frequency bins, respectively.
  • Figure 2: Spectrogram visualizations of the noisy speech, clean speech, and enhanced speech from our proposed MambAttention and the Conformer, LSTM, xLSTM, and Mamba baselines.
  • Figure 3: t-SNE visualizations of the VB-DemandEx, DNS 2020 without reverb, and EARS-WHAM_v2 test sets.
  • Figure 4: t-SNE visualizations of the VB-DemandEx, DNS 2020 without reverb, and EARS-WHAM_v2 test sets along with their clean references.