Table of Contents
Fetching ...

Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition

Yoshiki Masuyama, Koichi Miyazaki, Masato Murata

TL;DR

The capability of Mamba as the decoder-only architecture in ASR task is explored and a single decoder that takes speech tokens as a condition and predicts text tokens in an autoregressive manner is proposed, which significantly outperforms a non-selective SSM.

Abstract

Selective state space models (SSMs) represented by Mamba have demonstrated their computational efficiency and promising outcomes in various tasks, including automatic speech recognition (ASR). Mamba has been applied to ASR task with the attention-based encoder-decoder framework, where the cross-attention mechanism between encoder and decoder remains. This paper explores the capability of Mamba as the decoder-only architecture in ASR task. Our MAmba-based DEcoder-ONly approach (MADEON) consists of a single decoder that takes speech tokens as a condition and predicts text tokens in an autoregressive manner. To enhance MADEON, we further propose speech prefixing that performs bidirectional processing on speech tokens, which enriches the contextual information in the hidden states. Our experiments show that MADEON significantly outperforms a non-selective SSM. The combination of speech prefixing and the recently proposed Mamba-2 yields comparable performance to Transformer-based models on large datasets.

Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition

TL;DR

The capability of Mamba as the decoder-only architecture in ASR task is explored and a single decoder that takes speech tokens as a condition and predicts text tokens in an autoregressive manner is proposed, which significantly outperforms a non-selective SSM.

Abstract

Selective state space models (SSMs) represented by Mamba have demonstrated their computational efficiency and promising outcomes in various tasks, including automatic speech recognition (ASR). Mamba has been applied to ASR task with the attention-based encoder-decoder framework, where the cross-attention mechanism between encoder and decoder remains. This paper explores the capability of Mamba as the decoder-only architecture in ASR task. Our MAmba-based DEcoder-ONly approach (MADEON) consists of a single decoder that takes speech tokens as a condition and predicts text tokens in an autoregressive manner. To enhance MADEON, we further propose speech prefixing that performs bidirectional processing on speech tokens, which enriches the contextual information in the hidden states. Our experiments show that MADEON significantly outperforms a non-selective SSM. The combination of speech prefixing and the recently proposed Mamba-2 yields comparable performance to Transformer-based models on large datasets.

Paper Structure

This paper contains 19 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of MADEON for ASR task. The blue and red circles show the speech and text tokens obtained through subword modeling, respectively. The black circles are special tokens, and the gray dotted lines indicate the autoregressive text generation.
  • Figure 2: Architecture of (a) the original Mamba block and (b) the parallel Mamba-SP block. The selective SSM blocks used in the original Mamba and Mamba-2 are shown in (c) and (d), respectively. The symbol $\oslash$ indicates that a single vector is split into multiple vectors mamba2. STR denotes the speech token reversal whose detail is shown in Fig. \ref{['fig:str']}.
  • Figure 3: Illustration of the speech token reversal that rearranges the features of speech tokens in reverse order. Features of speech and text tokens are colored by blue and red, respectively.
  • Figure 4: Illustration of normalized WERs of MADEON and MADEON-2 with and without speech prefixing across different word positions on LibriSpeech 100h.