U-Mamba-Net: A highly efficient Mamba-based U-net style network for noisy and reverberant speech separation

Shaoxiang Dang; Tetsuya Matsumoto; Yoshinori Takeuchi; Hiroaki Kudo

U-Mamba-Net: A highly efficient Mamba-based U-net style network for noisy and reverberant speech separation

Shaoxiang Dang, Tetsuya Matsumoto, Yoshinori Takeuchi, Hiroaki Kudo

TL;DR

The paper tackles efficient speech separation in noisy and reverberant environments by proposing U-Mamba-Net, a lightweight U-Net–based architecture that interleaves U-Net blocks with a Mamba selective state-space module to capture long-range dependencies at linear cost. Through HiPPO-initialized state-space processing and input-dependent filtering, the network achieves strong SI-SNRi improvements and substantially lower GMACs compared with baselines such as DPRNN in a CMTL setting. While perceptual and denoising metrics are competitive, CMTL-based approaches still exhibit advantages in certain perceptual dimensions, highlighting a trade-off between efficiency and some aspects of perceptual quality. Overall, U-Mamba-Net demonstrates a practical, resource-efficient solution for complex-environment speech separation with potential for real-time deployment in constrained hardware.

Abstract

The topic of speech separation involves separating mixed speech with multiple overlapping speakers into several streams, with each stream containing speech from only one speaker. Many highly effective models have emerged and proliferated rapidly over time. However, the size and computational load of these models have also increased accordingly. This is a disaster for the community, as researchers need more time and computational resources to reproduce and compare existing models. In this paper, we propose U-mamba-net: a lightweight Mamba-based U-style model for speech separation in complex environments. Mamba is a state space sequence model that incorporates feature selection capabilities. U-style network is a fully convolutional neural network whose symmetric contracting and expansive paths are able to learn multi-resolution features. In our work, Mamba serves as a feature filter, alternating with U-Net. We test the proposed model on Libri2mix. The results show that U-Mamba-Net achieves improved performance with quite low computational cost.

U-Mamba-Net: A highly efficient Mamba-based U-net style network for noisy and reverberant speech separation

TL;DR

Abstract

U-Mamba-Net: A highly efficient Mamba-based U-net style network for noisy and reverberant speech separation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)