Table of Contents
Fetching ...

SepMamba: State-space models for speaker separation using Mamba

Thor Højhus Avenstrup, Boldizsár Elek, István László Mádi, András Bence Schin, Morten Mørup, Bjørn Sand Jensen, Kenny Falkær Olsen

TL;DR

This work proposes Sep-Mamba, a U-Net-based architecture composed of bidirectional Mamba layers that outperforms similarly-sized prominent models — including transformer-based models — on the WSJ0 2-speaker dataset while enjoying significant computational benefits in terms of multiply-accumulates, peak memory usage, and wall-clock time.

Abstract

Deep learning-based single-channel speaker separation has improved significantly in recent years largely due to the introduction of the transformer-based attention mechanism. However, these improvements come at the expense of intense computational demands, precluding their use in many practical applications. As a computationally efficient alternative with similar modeling capabilities, Mamba was recently introduced. We propose SepMamba, a U-Net-based architecture composed primarily of bidirectional Mamba layers. We find that our approach outperforms similarly-sized prominent models - including transformer-based models - on the WSJ0 2-speaker dataset while enjoying a significant reduction in computational cost, memory usage, and forward pass time. We additionally report strong results for causal variants of SepMamba. Our approach provides a computationally favorable alternative to transformer-based architectures for deep speech separation.

SepMamba: State-space models for speaker separation using Mamba

TL;DR

This work proposes Sep-Mamba, a U-Net-based architecture composed of bidirectional Mamba layers that outperforms similarly-sized prominent models — including transformer-based models — on the WSJ0 2-speaker dataset while enjoying significant computational benefits in terms of multiply-accumulates, peak memory usage, and wall-clock time.

Abstract

Deep learning-based single-channel speaker separation has improved significantly in recent years largely due to the introduction of the transformer-based attention mechanism. However, these improvements come at the expense of intense computational demands, precluding their use in many practical applications. As a computationally efficient alternative with similar modeling capabilities, Mamba was recently introduced. We propose SepMamba, a U-Net-based architecture composed primarily of bidirectional Mamba layers. We find that our approach outperforms similarly-sized prominent models - including transformer-based models - on the WSJ0 2-speaker dataset while enjoying a significant reduction in computational cost, memory usage, and forward pass time. We additionally report strong results for causal variants of SepMamba. Our approach provides a computationally favorable alternative to transformer-based architectures for deep speech separation.

Paper Structure

This paper contains 7 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: SepMamba has 5 stages of processing with Bamba stacks. The downsampling and upsampling is handled by convolutional and matching transposed convolutional layers. The skip connections are projected into the required dimension with $1 \times 1$ convolutions. We double the dimension of the Mamba blocks after each downsampling by a factor of 2, and halve it after each upsampling (ensuring matching dimensions on the same level).
  • Figure 2: (Left) Average forward pass time on an NVIDIA A100 GPU for 4 seconds of audio samples at 8 kHz. (Middle) Peak GPU memory usage during the backpropagation of a 4 seconds sample at 8kHz on an NVIDIA A100 GPU. (Right) Multiply-accumulate (MAC) operations per seconds. *For MossFormer2 SI-SDRi is listed instead of SI-SNRi.