Table of Contents
Fetching ...

Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation

Xilin Jiang, Cong Han, Nima Mesgarani

TL;DR

Efficient long-sequence speech separation requires modeling dependencies without quadratic complexity. The paper introduces Dual-path Mamba (DPMamba), which combines selective state-space modeling (Mamba) with a dual-path, time-domain architecture and bidirectional processing to capture local and global context. Key contributions include formulating a selective SSM with input-adaptive dynamics, a time-domain dual-path network with BiMamba blocks achieving SI-SNRi up to $22.6$ dB and SDRi up to $22.7$ dB on WSJ0-2mix with relatively small parameter counts, and comprehensive ablations highlighting the value of backward processing and dynamic mixing. The results demonstrate that linear-complexity Mamba can match or surpass CNN/RNN/Transformer baselines with lower memory, enabling more efficient deployment for speech separation.

Abstract

Transformers have been the most successful architecture for various speech modeling tasks, including speech separation. However, the self-attention mechanism in transformers with quadratic complexity is inefficient in computation and memory. Recent models incorporate new layers and modules along with transformers for better performance but also introduce extra model complexity. In this work, we replace transformers with Mamba, a selective state space model, for speech separation. We propose dual-path Mamba, which models short-term and long-term forward and backward dependency of speech signals using selective state spaces. Our experimental results on the WSJ0-2mix data show that our dual-path Mamba models of comparably smaller sizes outperform state-of-the-art RNN model DPRNN, CNN model WaveSplit, and transformer model Sepformer. Code: https://github.com/xi-j/Mamba-TasNet

Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation

TL;DR

Efficient long-sequence speech separation requires modeling dependencies without quadratic complexity. The paper introduces Dual-path Mamba (DPMamba), which combines selective state-space modeling (Mamba) with a dual-path, time-domain architecture and bidirectional processing to capture local and global context. Key contributions include formulating a selective SSM with input-adaptive dynamics, a time-domain dual-path network with BiMamba blocks achieving SI-SNRi up to dB and SDRi up to dB on WSJ0-2mix with relatively small parameter counts, and comprehensive ablations highlighting the value of backward processing and dynamic mixing. The results demonstrate that linear-complexity Mamba can match or surpass CNN/RNN/Transformer baselines with lower memory, enabling more efficient deployment for speech separation.

Abstract

Transformers have been the most successful architecture for various speech modeling tasks, including speech separation. However, the self-attention mechanism in transformers with quadratic complexity is inefficient in computation and memory. Recent models incorporate new layers and modules along with transformers for better performance but also introduce extra model complexity. In this work, we replace transformers with Mamba, a selective state space model, for speech separation. We propose dual-path Mamba, which models short-term and long-term forward and backward dependency of speech signals using selective state spaces. Our experimental results on the WSJ0-2mix data show that our dual-path Mamba models of comparably smaller sizes outperform state-of-the-art RNN model DPRNN, CNN model WaveSplit, and transformer model Sepformer. Code: https://github.com/xi-j/Mamba-TasNet
Paper Structure (16 sections, 9 equations, 2 figures, 3 tables)

This paper contains 16 sections, 9 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: A top-down view of DPMamba from I to IV.
  • Figure 2: A comparsion of GPU memory usage of DPMamba with Sepformer and DPRNN.