Table of Contents
Fetching ...

Mamba for Streaming ASR Combined with Unimodal Aggregation

Ying Fang, Xiaofei Li

TL;DR

This paper tackles real-time streaming ASR by combining a linear-complexity state-space encoder (Mamba) with unimodal aggregation (UMA) to create explicit token boundaries and frame- level representations. A convolutional lookahead layer and an optional early termination (ET) mechanism are integrated to balance accuracy and latency, with end-to-end training using CTC. Experiments on AISHELL-1 and AISHELL-2 show that Mamba-UMA achieves competitive or superior character error rates while significantly reducing latency compared to Transformer- and Conformer-based baselines, with ET further lowering average latency. The work highlights the feasibility of state-space models for streaming ASR and demonstrates the practical benefits of UMA and lookahead in reducing latency without sacrificing recognition performance.

Abstract

This paper works on streaming automatic speech recognition (ASR). Mamba, a recently proposed state space model, has demonstrated the ability to match or surpass Transformers in various tasks while benefiting from a linear complexity advantage. We explore the efficiency of Mamba encoder for streaming ASR and propose an associated lookahead mechanism for leveraging controllable future information. Additionally, a streaming-style unimodal aggregation (UMA) method is implemented, which automatically detects token activity and streamingly triggers token output, and meanwhile aggregates feature frames for better learning token representation. Based on UMA, an early termination (ET) method is proposed to further reduce recognition latency. Experiments conducted on two Mandarin Chinese datasets demonstrate that the proposed model achieves competitive ASR performance in terms of both recognition accuracy and latency.

Mamba for Streaming ASR Combined with Unimodal Aggregation

TL;DR

This paper tackles real-time streaming ASR by combining a linear-complexity state-space encoder (Mamba) with unimodal aggregation (UMA) to create explicit token boundaries and frame- level representations. A convolutional lookahead layer and an optional early termination (ET) mechanism are integrated to balance accuracy and latency, with end-to-end training using CTC. Experiments on AISHELL-1 and AISHELL-2 show that Mamba-UMA achieves competitive or superior character error rates while significantly reducing latency compared to Transformer- and Conformer-based baselines, with ET further lowering average latency. The work highlights the feasibility of state-space models for streaming ASR and demonstrates the practical benefits of UMA and lookahead in reducing latency without sacrificing recognition performance.

Abstract

This paper works on streaming automatic speech recognition (ASR). Mamba, a recently proposed state space model, has demonstrated the ability to match or surpass Transformers in various tasks while benefiting from a linear complexity advantage. We explore the efficiency of Mamba encoder for streaming ASR and propose an associated lookahead mechanism for leveraging controllable future information. Additionally, a streaming-style unimodal aggregation (UMA) method is implemented, which automatically detects token activity and streamingly triggers token output, and meanwhile aggregates feature frames for better learning token representation. Based on UMA, an early termination (ET) method is proposed to further reduce recognition latency. Experiments conducted on two Mandarin Chinese datasets demonstrate that the proposed model achieves competitive ASR performance in terms of both recognition accuracy and latency.
Paper Structure (13 sections, 2 equations, 3 figures, 2 tables)

This paper contains 13 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Model architecture. $\sigma$ represents activation layer. Residual connection and normalization are omitted in encoder and decoder.
  • Figure 2: An example of streaming UMA. The spectrogram and UMA weights marked in solid box/line correspond to one same character. The blue and red arrows mark a UMA valley and peak, respectively.
  • Figure 3: Experimental results of the lookahead mechanism and ET method with Mamba UMA on AISHELL-1.