Table of Contents
Fetching ...

Exploring the Capability of Mamba in Speech Applications

Koichi Miyazaki, Yoshiki Masuyama, Masato Murata

TL;DR

Comparisons with state-of-the-art Transformer variants for various speech applications revealed that Mamba achieves comparable or better performance than Transformer-based models, and demonstrated its efficiency in long-form speech processing.

Abstract

This paper explores the capability of Mamba, a recently proposed architecture based on state space models (SSMs), as a competitive alternative to Transformer-based models. In the speech domain, well-designed Transformer-based models, such as the Conformer and E-Branchformer, have become the de facto standards. Extensive evaluations have demonstrated the effectiveness of these Transformer-based models across a wide range of speech tasks. In contrast, the evaluation of SSMs has been limited to a few tasks, such as automatic speech recognition (ASR) and speech synthesis. In this paper, we compared Mamba with state-of-the-art Transformer variants for various speech applications, including ASR, text-to-speech, spoken language understanding, and speech summarization. Experimental evaluations revealed that Mamba achieves comparable or better performance than Transformer-based models, and demonstrated its efficiency in long-form speech processing.

Exploring the Capability of Mamba in Speech Applications

TL;DR

Comparisons with state-of-the-art Transformer variants for various speech applications revealed that Mamba achieves comparable or better performance than Transformer-based models, and demonstrated its efficiency in long-form speech processing.

Abstract

This paper explores the capability of Mamba, a recently proposed architecture based on state space models (SSMs), as a competitive alternative to Transformer-based models. In the speech domain, well-designed Transformer-based models, such as the Conformer and E-Branchformer, have become the de facto standards. Extensive evaluations have demonstrated the effectiveness of these Transformer-based models across a wide range of speech tasks. In contrast, the evaluation of SSMs has been limited to a few tasks, such as automatic speech recognition (ASR) and speech synthesis. In this paper, we compared Mamba with state-of-the-art Transformer variants for various speech applications, including ASR, text-to-speech, spoken language understanding, and speech summarization. Experimental evaluations revealed that Mamba achieves comparable or better performance than Transformer-based models, and demonstrated its efficiency in long-form speech processing.

Paper Structure

This paper contains 16 sections, 2 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Architecture of the Mamba block. (a) The original Mamba block. (b) The Mamba encoder block extends the original Mamba block in a bidirectional design. This modification allows for capturing past and future contexts in the input sequences. (c) The Mamba decoder. To bridge the encoder output, we employed a cross-attention after an original Mamba block.
  • Figure 2: Long-form ASR results on TEDLIUM2.
  • Figure 3: Preference test results on CFS2 vs. MFS2.