Table of Contents
Fetching ...

Mamba in Speech: Towards an Alternative to Self-Attention

Xiangyu Zhang, Qiquan Zhang, Hexin Liu, Tianyi Xiao, Xinyuan Qian, Beena Ahmed, Eliathamby Ambikairajah, Haizhou Li, Julien Epps

TL;DR

This work assesses the viability of Selective State Space Model-based Mamba as a scalable alternative to self-attention in speech. It shows that bidirectional BiMamba designs, particularly ExtBiMamba, better capture global dependencies and semantic information for speech recognition while maintaining efficiency, outperforming MHSA-based baselines in several settings. Through targeted ablations, the authors demonstrate that introducing nonlinearity via FFN and residual connections is crucial for high-level information tasks, and they provide guidance on parameter choices and initialization. The results indicate that ConExtBiMamba can achieve or exceed state-of-the-art performance on ASR benchmarks and that Mamba approaches hold promise for broader speech tasks beyond enhancement and recognition.

Abstract

Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, and speech processing. To reduce the complexity of computations within the multi-head self-attention mechanism in Transformer, Selective State Space Models (i.e., Mamba) were proposed as an alternative. Mamba exhibited its effectiveness in natural language processing and computer vision tasks, but its superiority has rarely been investigated in speech signal processing. This paper explores solutions for applying Mamba to speech processing by discussing two typical speech processing tasks: speech recognition, which requires semantic and sequential information, and speech enhancement, which focuses primarily on sequential patterns. The experimental results confirm that bidirectional Mamba (BiMamba) consistently outperforms vanilla Mamba, highlighting the advantages of a bidirectional design for speech processing. Moreover, experiments demonstrate the effectiveness of BiMamba as an alternative to the self-attention module in the Transformer model and its derivates, particularly for the semantic-aware task. The crucial technologies for transferring Mamba to speech are then summarized in ablation studies and the discussion section, offering insights for extending this research to a broader scope of tasks.

Mamba in Speech: Towards an Alternative to Self-Attention

TL;DR

This work assesses the viability of Selective State Space Model-based Mamba as a scalable alternative to self-attention in speech. It shows that bidirectional BiMamba designs, particularly ExtBiMamba, better capture global dependencies and semantic information for speech recognition while maintaining efficiency, outperforming MHSA-based baselines in several settings. Through targeted ablations, the authors demonstrate that introducing nonlinearity via FFN and residual connections is crucial for high-level information tasks, and they provide guidance on parameter choices and initialization. The results indicate that ConExtBiMamba can achieve or exceed state-of-the-art performance on ASR benchmarks and that Mamba approaches hold promise for broader speech tasks beyond enhancement and recognition.

Abstract

Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, and speech processing. To reduce the complexity of computations within the multi-head self-attention mechanism in Transformer, Selective State Space Models (i.e., Mamba) were proposed as an alternative. Mamba exhibited its effectiveness in natural language processing and computer vision tasks, but its superiority has rarely been investigated in speech signal processing. This paper explores solutions for applying Mamba to speech processing by discussing two typical speech processing tasks: speech recognition, which requires semantic and sequential information, and speech enhancement, which focuses primarily on sequential patterns. The experimental results confirm that bidirectional Mamba (BiMamba) consistently outperforms vanilla Mamba, highlighting the advantages of a bidirectional design for speech processing. Moreover, experiments demonstrate the effectiveness of BiMamba as an alternative to the self-attention module in the Transformer model and its derivates, particularly for the semantic-aware task. The crucial technologies for transferring Mamba to speech are then summarized in ablation studies and the discussion section, offering insights for extending this research to a broader scope of tasks.
Paper Structure (17 sections, 5 equations, 4 figures, 16 tables, 2 algorithms)

This paper contains 17 sections, 5 equations, 4 figures, 16 tables, 2 algorithms.

Figures (4)

  • Figure 1: The illustrations of (a) inner bidirectional Mamba (InnBiMamba) from Vison Mamba zhu2024vision, and (b) the external bidirectional Mamba (ExtBiMamba), where $\sigma$ denotes the SiLU activation.
  • Figure 2: Three applications of the Mamba layer in speech processing include: (a) using stacked unidirectional/bidirectional Mamba layers as an alternative to Transformer layers; (b) replacing causal and non-causal MHSA in Transformer layer with unidirectional/bidirectional Mamba, termed TransMamba and TransBiMamba; and (c) replacing MHSA in Conformer layer with Mamba, termed ConMamba and ConBiMamba.
  • Figure 3: Train and Dev Set Loss of Conformer and ConExtBiMamba in AN4 dataset
  • Figure 4: Decision Boundaries for BiMamba and BiMamba with Feed-forward layer (FFN)