Table of Contents
Fetching ...

Universal Speech Enhancement with Regression and Generative Mamba

Rong Chao, Rauf Nasretdinov, Yu-Chiang Frank Wang, Ante Jukić, Szu-Wei Fu, Yu Tsao

TL;DR

Universal Speech Enhancement Mamba (USEMamba) tackles robust, universal SE across seven distortions and five languages by leveraging a state-space backbone (Mamba) with time-frequency processing and sampling-rate independent features. The method employs regression-based magnitude mapping for most distortions and a generative flow-based variant (USEMamba-Flow) for content-generation cases like bandwidth extension and packet loss, with a simple energy-based merger to balance outputs. Architectural advances include deeper TF-Mamba blocks, a mapping-based loss design, and flow-based training, enabling strong generalization while reducing memory compared to Transformer-based models. On the URGENT 2025 Challenge, USEMamba achieved 2nd place in the blind phase and demonstrated competitive performance when combining regression and generative approaches, highlighting its practical impact for diverse, real-world speech restoration tasks.

Abstract

The Interspeech 2025 URGENT Challenge aimed to advance universal, robust, and generalizable speech enhancement by unifying speech enhancement tasks across a wide variety of conditions, including seven different distortion types and five languages. We present Universal Speech Enhancement Mamba (USEMamba), a state-space speech enhancement model designed to handle long-range sequence modeling, time-frequency structured processing, and sampling frequency-independent feature extraction. Our approach primarily relies on regression-based modeling, which performs well across most distortions. However, for packet loss and bandwidth extension, where missing content must be inferred, a generative variant of the proposed USEMamba proves more effective. Despite being trained on only a subset of the full training data, USEMamba achieved 2nd place in Track 1 during the blind test phase, demonstrating strong generalization across diverse conditions.

Universal Speech Enhancement with Regression and Generative Mamba

TL;DR

Universal Speech Enhancement Mamba (USEMamba) tackles robust, universal SE across seven distortions and five languages by leveraging a state-space backbone (Mamba) with time-frequency processing and sampling-rate independent features. The method employs regression-based magnitude mapping for most distortions and a generative flow-based variant (USEMamba-Flow) for content-generation cases like bandwidth extension and packet loss, with a simple energy-based merger to balance outputs. Architectural advances include deeper TF-Mamba blocks, a mapping-based loss design, and flow-based training, enabling strong generalization while reducing memory compared to Transformer-based models. On the URGENT 2025 Challenge, USEMamba achieved 2nd place in the blind phase and demonstrated competitive performance when combining regression and generative approaches, highlighting its practical impact for diverse, real-world speech restoration tasks.

Abstract

The Interspeech 2025 URGENT Challenge aimed to advance universal, robust, and generalizable speech enhancement by unifying speech enhancement tasks across a wide variety of conditions, including seven different distortion types and five languages. We present Universal Speech Enhancement Mamba (USEMamba), a state-space speech enhancement model designed to handle long-range sequence modeling, time-frequency structured processing, and sampling frequency-independent feature extraction. Our approach primarily relies on regression-based modeling, which performs well across most distortions. However, for packet loss and bandwidth extension, where missing content must be inferred, a generative variant of the proposed USEMamba proves more effective. Despite being trained on only a subset of the full training data, USEMamba achieved 2nd place in Track 1 during the blind test phase, demonstrating strong generalization across diverse conditions.

Paper Structure

This paper contains 14 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the Universal SEMamba architecture.
  • Figure 2: GPU VRAM requirement for 16kHz audio in training.
  • Figure 3: Spectrogram comparison of (a) distortion input (noisy, bandwidth limitation, and packet loss) and enhanced speech from (b) regression model (USEMamba), (c) generative model (USEMamba-Flow), and (d) proposed combined method.