Table of Contents
Fetching ...

U-Mamba-Net: A highly efficient Mamba-based U-net style network for noisy and reverberant speech separation

Shaoxiang Dang, Tetsuya Matsumoto, Yoshinori Takeuchi, Hiroaki Kudo

TL;DR

The paper tackles efficient speech separation in noisy and reverberant environments by proposing U-Mamba-Net, a lightweight U-Net–based architecture that interleaves U-Net blocks with a Mamba selective state-space module to capture long-range dependencies at linear cost. Through HiPPO-initialized state-space processing and input-dependent filtering, the network achieves strong SI-SNRi improvements and substantially lower GMACs compared with baselines such as DPRNN in a CMTL setting. While perceptual and denoising metrics are competitive, CMTL-based approaches still exhibit advantages in certain perceptual dimensions, highlighting a trade-off between efficiency and some aspects of perceptual quality. Overall, U-Mamba-Net demonstrates a practical, resource-efficient solution for complex-environment speech separation with potential for real-time deployment in constrained hardware.

Abstract

The topic of speech separation involves separating mixed speech with multiple overlapping speakers into several streams, with each stream containing speech from only one speaker. Many highly effective models have emerged and proliferated rapidly over time. However, the size and computational load of these models have also increased accordingly. This is a disaster for the community, as researchers need more time and computational resources to reproduce and compare existing models. In this paper, we propose U-mamba-net: a lightweight Mamba-based U-style model for speech separation in complex environments. Mamba is a state space sequence model that incorporates feature selection capabilities. U-style network is a fully convolutional neural network whose symmetric contracting and expansive paths are able to learn multi-resolution features. In our work, Mamba serves as a feature filter, alternating with U-Net. We test the proposed model on Libri2mix. The results show that U-Mamba-Net achieves improved performance with quite low computational cost.

U-Mamba-Net: A highly efficient Mamba-based U-net style network for noisy and reverberant speech separation

TL;DR

The paper tackles efficient speech separation in noisy and reverberant environments by proposing U-Mamba-Net, a lightweight U-Net–based architecture that interleaves U-Net blocks with a Mamba selective state-space module to capture long-range dependencies at linear cost. Through HiPPO-initialized state-space processing and input-dependent filtering, the network achieves strong SI-SNRi improvements and substantially lower GMACs compared with baselines such as DPRNN in a CMTL setting. While perceptual and denoising metrics are competitive, CMTL-based approaches still exhibit advantages in certain perceptual dimensions, highlighting a trade-off between efficiency and some aspects of perceptual quality. Overall, U-Mamba-Net demonstrates a practical, resource-efficient solution for complex-environment speech separation with potential for real-time deployment in constrained hardware.

Abstract

The topic of speech separation involves separating mixed speech with multiple overlapping speakers into several streams, with each stream containing speech from only one speaker. Many highly effective models have emerged and proliferated rapidly over time. However, the size and computational load of these models have also increased accordingly. This is a disaster for the community, as researchers need more time and computational resources to reproduce and compare existing models. In this paper, we propose U-mamba-net: a lightweight Mamba-based U-style model for speech separation in complex environments. Mamba is a state space sequence model that incorporates feature selection capabilities. U-style network is a fully convolutional neural network whose symmetric contracting and expansive paths are able to learn multi-resolution features. In our work, Mamba serves as a feature filter, alternating with U-Net. We test the proposed model on Libri2mix. The results show that U-Mamba-Net achieves improved performance with quite low computational cost.

Paper Structure

This paper contains 15 sections, 5 equations, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview of U-Mamba-Net.
  • Figure 2: Spectrogram of separation results. The sole spectrogram in the first row is noise and reverberant mixture. The following two spectrograms in the second row are ground truths. Third row are displaying the spectrograms of two separated results by DPRNN model. The last two are estimation of U-Mamba-Net. The red boxes highlight the places where DPRNN makes wrong separation, but U-Mamba does not. The white box outlines the place where U-Mamba-Net performs worse. Because the fundamental frequencies and harmonics of U-Mamba-Net are not as clear as those of the DPRNN.