Table of Contents
Fetching ...

Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models

Xiaoxue Gao, Nancy F. Chen

TL;DR

Speech-Mamba addresses the challenge of long-context speech recognition by marrying Selective State Space Models (SSMs) with Transformer architectures to achieve near-linear scalability for long sequences. The framework employs a Mamba encoder and decoder within a joint CTC/S2S objective, leveraging RMSNorm and selective SSMs to capture global speech-text dependencies while Transformer components model local temporal structure. Empirical results on LibriSpeech show that Speech-Mamba outperforms Transformer baselines and state-of-the-art models on long-context subsets, with substantial relative improvements and fewer parameters. The approach demonstrates the viability and practicality of long-context ASR through an end-to-end Mamba-augmented architecture, promising advances for large-scale, long-form speech understanding. The work contributes a concrete multi-objective training paradigm, detailed architectural design, and a rigorous evaluation on long-context data, highlighting its potential as a foundation for next-generation speech technology.

Abstract

Current automatic speech recognition systems struggle with modeling long speech sequences due to high quadratic complexity of Transformer-based models. Selective state space models such as Mamba has performed well on long-sequence modeling in natural language processing and computer vision tasks. However, research endeavors in speech technology tasks has been under-explored. We propose Speech-Mamba, which incorporates selective state space modeling in Transformer neural architectures. Long sequence representations with selective state space models in Speech-Mamba is complemented with lower-level representations from Transformer-based modeling. Speech-mamba achieves better capacity to model long-range dependencies, as it scales near-linearly with sequence length.

Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models

TL;DR

Speech-Mamba addresses the challenge of long-context speech recognition by marrying Selective State Space Models (SSMs) with Transformer architectures to achieve near-linear scalability for long sequences. The framework employs a Mamba encoder and decoder within a joint CTC/S2S objective, leveraging RMSNorm and selective SSMs to capture global speech-text dependencies while Transformer components model local temporal structure. Empirical results on LibriSpeech show that Speech-Mamba outperforms Transformer baselines and state-of-the-art models on long-context subsets, with substantial relative improvements and fewer parameters. The approach demonstrates the viability and practicality of long-context ASR through an end-to-end Mamba-augmented architecture, promising advances for large-scale, long-form speech understanding. The work contributes a concrete multi-objective training paradigm, detailed architectural design, and a rigorous evaluation on long-context data, highlighting its potential as a foundation for next-generation speech technology.

Abstract

Current automatic speech recognition systems struggle with modeling long speech sequences due to high quadratic complexity of Transformer-based models. Selective state space models such as Mamba has performed well on long-sequence modeling in natural language processing and computer vision tasks. However, research endeavors in speech technology tasks has been under-explored. We propose Speech-Mamba, which incorporates selective state space modeling in Transformer neural architectures. Long sequence representations with selective state space models in Speech-Mamba is complemented with lower-level representations from Transformer-based modeling. Speech-mamba achieves better capacity to model long-range dependencies, as it scales near-linearly with sequence length.
Paper Structure (19 sections, 1 equation, 1 figure, 5 tables)