Table of Contents
Fetching ...

An Investigation of Incorporating Mamba for Speech Enhancement

Rong Chao, Wen-Huang Cheng, Moreno La Quatra, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Szu-Wei Fu, Yu Tsao

TL;DR

This work investigates SEMamba, a Mamba-based, attention-free state-space model, for speech enhancement. It compares two core SEMamba configurations (basic causal and advanced non-causal) against Transformer baselines, and introduces enhancements such as bi-directional Mamba, consistency loss, and perceptual contrast stretching. Key results include a PESQ of 3.55 with advanced non-causal SEMamba and a new state-of-the-art of 3.69 when combined with PCS, along with substantial FLOPs and parameter reductions and improved ASR pre-processing performance. The findings suggest that Mamba-based SE can deliver competitive or superior quality with lower computational cost, with practical impact for downstream ASR systems and long-sequence processing tasks.

Abstract

This work aims to investigate the use of a recently proposed, attention-free, scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. In particular, we employ Mamba to deploy different regression-based SE models (SEMamba) with different configurations, namely basic, advanced, causal, and non-causal. Furthermore, loss functions either based on signal-level distances or metric-oriented are considered. Experimental evidence shows that SEMamba attains a competitive PESQ of 3.55 on the VoiceBank-DEMAND dataset with the advanced, non-causal configuration. A new state-of-the-art PESQ of 3.69 is also reported when SEMamba is combined with Perceptual Contrast Stretching (PCS). Compared against Transformed-based equivalent SE solutions, a noticeable FLOPs reduction up to ~12% is observed with the advanced non-causal configurations. Finally, SEMamba can be used as a pre-processing step before automatic speech recognition (ASR), showing competitive performance against recent SE solutions.

An Investigation of Incorporating Mamba for Speech Enhancement

TL;DR

This work investigates SEMamba, a Mamba-based, attention-free state-space model, for speech enhancement. It compares two core SEMamba configurations (basic causal and advanced non-causal) against Transformer baselines, and introduces enhancements such as bi-directional Mamba, consistency loss, and perceptual contrast stretching. Key results include a PESQ of 3.55 with advanced non-causal SEMamba and a new state-of-the-art of 3.69 when combined with PCS, along with substantial FLOPs and parameter reductions and improved ASR pre-processing performance. The findings suggest that Mamba-based SE can deliver competitive or superior quality with lower computational cost, with practical impact for downstream ASR systems and long-sequence processing tasks.

Abstract

This work aims to investigate the use of a recently proposed, attention-free, scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. In particular, we employ Mamba to deploy different regression-based SE models (SEMamba) with different configurations, namely basic, advanced, causal, and non-causal. Furthermore, loss functions either based on signal-level distances or metric-oriented are considered. Experimental evidence shows that SEMamba attains a competitive PESQ of 3.55 on the VoiceBank-DEMAND dataset with the advanced, non-causal configuration. A new state-of-the-art PESQ of 3.69 is also reported when SEMamba is combined with Perceptual Contrast Stretching (PCS). Compared against Transformed-based equivalent SE solutions, a noticeable FLOPs reduction up to ~12% is observed with the advanced non-causal configurations. Finally, SEMamba can be used as a pre-processing step before automatic speech recognition (ASR), showing competitive performance against recent SE solutions.
Paper Structure (14 sections, 1 equation, 3 figures, 3 tables)

This paper contains 14 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Architecture of our basic Mamba-based Speech Enhancement (SE) model, SEMamba-basic.
  • Figure 2: Architecture of the proposed SEMamba-advanced with Time-Frequency (TF) and Selective-SSM mechanism.
  • Figure 3: Comparative analysis of WERs for SEMamba and related models on the VoiceBank-DEMAND dataset with Whisper ASR radford2023robust.