Table of Contents
Fetching ...

TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR

Qingshun She, Jing Peng, Yangui Fang, Yu Xi, Kai Yu

TL;DR

TC-BiMamba tackles the problem of unifying streaming and offline ASR in a single model by introducing a BiMamba-based encoder enhanced with CNNs and a Hybrid BiTransformer decoder. The key innovation, Trans-Chunk, trains bidirectional BiMamba offline with dynamic chunk sizes, flipping the backward sequence across the full input to provide complete historical context while maintaining training efficiency. The model is trained with a joint CTC-E2E objective and two-pass decoding, achieving superior or competitive results compared to U2++ and LC-BiMamba while reducing training overhead and memory usage. Experiments on AISHELL-1/2 and LibriSpeech demonstrate strong performance gains in both offline and streaming modes and substantial training efficiency improvements, highlighting the practical impact of dynamic chunk-size training for unified ASR systems.

Abstract

This work investigates bidirectional Mamba (BiMamba) for unified streaming and non-streaming automatic speech recognition (ASR). Dynamic chunk size training enables a single model for offline decoding and streaming decoding with various latency settings. In contrast, existing BiMamba based streaming method is limited to fixed chunk size decoding. When dynamic chunk size training is applied, training overhead increases substantially. To tackle this issue, we propose the Trans-Chunk BiMamba (TC-BiMamba) for dynamic chunk size training. Trans-Chunk mechanism trains both bidirectional sequences in an offline style with dynamic chunk size. On the one hand, compared to traditional chunk-wise processing, TC-BiMamba simultaneously achieves 1.3 times training speedup, reduces training memory by 50%, and improves model performance since it can capture bidirectional context. On the other hand, experimental results show that TC-BiMamba outperforms U2++ and matches LC-BiMmaba with smaller model size.

TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR

TL;DR

TC-BiMamba tackles the problem of unifying streaming and offline ASR in a single model by introducing a BiMamba-based encoder enhanced with CNNs and a Hybrid BiTransformer decoder. The key innovation, Trans-Chunk, trains bidirectional BiMamba offline with dynamic chunk sizes, flipping the backward sequence across the full input to provide complete historical context while maintaining training efficiency. The model is trained with a joint CTC-E2E objective and two-pass decoding, achieving superior or competitive results compared to U2++ and LC-BiMamba while reducing training overhead and memory usage. Experiments on AISHELL-1/2 and LibriSpeech demonstrate strong performance gains in both offline and streaming modes and substantial training efficiency improvements, highlighting the practical impact of dynamic chunk-size training for unified ASR systems.

Abstract

This work investigates bidirectional Mamba (BiMamba) for unified streaming and non-streaming automatic speech recognition (ASR). Dynamic chunk size training enables a single model for offline decoding and streaming decoding with various latency settings. In contrast, existing BiMamba based streaming method is limited to fixed chunk size decoding. When dynamic chunk size training is applied, training overhead increases substantially. To tackle this issue, we propose the Trans-Chunk BiMamba (TC-BiMamba) for dynamic chunk size training. Trans-Chunk mechanism trains both bidirectional sequences in an offline style with dynamic chunk size. On the one hand, compared to traditional chunk-wise processing, TC-BiMamba simultaneously achieves 1.3 times training speedup, reduces training memory by 50%, and improves model performance since it can capture bidirectional context. On the other hand, experimental results show that TC-BiMamba outperforms U2++ and matches LC-BiMmaba with smaller model size.
Paper Structure (15 sections, 5 equations, 3 figures, 2 tables)

This paper contains 15 sections, 5 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Assuming chunk size is 4. The upper part is an example of our Trans-Chunk mechanism. The lower part is an example of traditional chunk-wise processing. Trans-Chunk mechanism trains both directional sequences entirely, whereas traditional chunk-wise processing segments backward directional batch into mini-batch, which leads to slower training speed, higher GPU memery utilization and worse performance.
  • Figure 2: Overall architecture of TC-BiMamba. Positional embedding is not employed.
  • Figure 3: Training overhead comprison between TC-BiMamba(L) and TC-BiMamba(L)-CS on LribriSpeech.