Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection
Haoyu Wang, Guoqiang Hu, Guodong Lin, Wei-Qiang Zhang, Jian Li
TL;DR
Simul-Whisper addresses the challenge of streaming ASR with the Whisper model by leveraging cross-attention-based temporal alignment to guide autoregressive decoding, without any fine-tuning. It introduces an Integrate-and-Fire (IF) truncation detection module to remove unreliable truncations at chunk boundaries, combining encoder and decoder cues for robust streaming inference. Empirical results on LibriSpeech and Multilingual LibriSpeech demonstrate low degradation in word error rate at 1-second chunks and favorable latency compared to the Local Agreement baseline, with consistent gains from the truncation detector across languages and architectures. The approach offers a practical, computation-conscious streaming solution for large pre-trained encoder-decoder models, albeit with padding-induced latency and room for further improvement through techniques like self-distillation.
Abstract
As a robust and large-scale multilingual speech recognition model, Whisper has demonstrated impressive results in many low-resource and out-of-distribution scenarios. However, its encoder-decoder structure hinders its application to streaming speech recognition. In this paper, we introduce Simul-Whisper, which uses the time alignment embedded in Whisper's cross-attention to guide auto-regressive decoding and achieve chunk-based streaming ASR without any fine-tuning of the pre-trained model. Furthermore, we observe the negative effect of the truncated words at the chunk boundaries on the decoding results and propose an integrate-and-fire-based truncation detection model to address this issue. Experiments on multiple languages and Whisper architectures show that Simul-Whisper achieves an average absolute word error rate degradation of only 1.46% at a chunk size of 1 second, which significantly outperforms the current state-of-the-art baseline.
