TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR
Qingshun She, Jing Peng, Yangui Fang, Yu Xi, Kai Yu
TL;DR
TC-BiMamba tackles the problem of unifying streaming and offline ASR in a single model by introducing a BiMamba-based encoder enhanced with CNNs and a Hybrid BiTransformer decoder. The key innovation, Trans-Chunk, trains bidirectional BiMamba offline with dynamic chunk sizes, flipping the backward sequence across the full input to provide complete historical context while maintaining training efficiency. The model is trained with a joint CTC-E2E objective and two-pass decoding, achieving superior or competitive results compared to U2++ and LC-BiMamba while reducing training overhead and memory usage. Experiments on AISHELL-1/2 and LibriSpeech demonstrate strong performance gains in both offline and streaming modes and substantial training efficiency improvements, highlighting the practical impact of dynamic chunk-size training for unified ASR systems.
Abstract
This work investigates bidirectional Mamba (BiMamba) for unified streaming and non-streaming automatic speech recognition (ASR). Dynamic chunk size training enables a single model for offline decoding and streaming decoding with various latency settings. In contrast, existing BiMamba based streaming method is limited to fixed chunk size decoding. When dynamic chunk size training is applied, training overhead increases substantially. To tackle this issue, we propose the Trans-Chunk BiMamba (TC-BiMamba) for dynamic chunk size training. Trans-Chunk mechanism trains both bidirectional sequences in an offline style with dynamic chunk size. On the one hand, compared to traditional chunk-wise processing, TC-BiMamba simultaneously achieves 1.3 times training speedup, reduces training memory by 50%, and improves model performance since it can capture bidirectional context. On the other hand, experimental results show that TC-BiMamba outperforms U2++ and matches LC-BiMmaba with smaller model size.
