Table of Contents
Fetching ...

Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection

Haoyu Wang, Guoqiang Hu, Guodong Lin, Wei-Qiang Zhang, Jian Li

TL;DR

Simul-Whisper addresses the challenge of streaming ASR with the Whisper model by leveraging cross-attention-based temporal alignment to guide autoregressive decoding, without any fine-tuning. It introduces an Integrate-and-Fire (IF) truncation detection module to remove unreliable truncations at chunk boundaries, combining encoder and decoder cues for robust streaming inference. Empirical results on LibriSpeech and Multilingual LibriSpeech demonstrate low degradation in word error rate at 1-second chunks and favorable latency compared to the Local Agreement baseline, with consistent gains from the truncation detector across languages and architectures. The approach offers a practical, computation-conscious streaming solution for large pre-trained encoder-decoder models, albeit with padding-induced latency and room for further improvement through techniques like self-distillation.

Abstract

As a robust and large-scale multilingual speech recognition model, Whisper has demonstrated impressive results in many low-resource and out-of-distribution scenarios. However, its encoder-decoder structure hinders its application to streaming speech recognition. In this paper, we introduce Simul-Whisper, which uses the time alignment embedded in Whisper's cross-attention to guide auto-regressive decoding and achieve chunk-based streaming ASR without any fine-tuning of the pre-trained model. Furthermore, we observe the negative effect of the truncated words at the chunk boundaries on the decoding results and propose an integrate-and-fire-based truncation detection model to address this issue. Experiments on multiple languages and Whisper architectures show that Simul-Whisper achieves an average absolute word error rate degradation of only 1.46% at a chunk size of 1 second, which significantly outperforms the current state-of-the-art baseline.

Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection

TL;DR

Simul-Whisper addresses the challenge of streaming ASR with the Whisper model by leveraging cross-attention-based temporal alignment to guide autoregressive decoding, without any fine-tuning. It introduces an Integrate-and-Fire (IF) truncation detection module to remove unreliable truncations at chunk boundaries, combining encoder and decoder cues for robust streaming inference. Empirical results on LibriSpeech and Multilingual LibriSpeech demonstrate low degradation in word error rate at 1-second chunks and favorable latency compared to the Local Agreement baseline, with consistent gains from the truncation detector across languages and architectures. The approach offers a practical, computation-conscious streaming solution for large pre-trained encoder-decoder models, albeit with padding-induced latency and room for further improvement through techniques like self-distillation.

Abstract

As a robust and large-scale multilingual speech recognition model, Whisper has demonstrated impressive results in many low-resource and out-of-distribution scenarios. However, its encoder-decoder structure hinders its application to streaming speech recognition. In this paper, we introduce Simul-Whisper, which uses the time alignment embedded in Whisper's cross-attention to guide auto-regressive decoding and achieve chunk-based streaming ASR without any fine-tuning of the pre-trained model. Furthermore, we observe the negative effect of the truncated words at the chunk boundaries on the decoding results and propose an integrate-and-fire-based truncation detection model to address this issue. Experiments on multiple languages and Whisper architectures show that Simul-Whisper achieves an average absolute word error rate degradation of only 1.46% at a chunk size of 1 second, which significantly outperforms the current state-of-the-art baseline.
Paper Structure (12 sections, 5 equations, 2 figures, 1 table, 1 algorithm)

This paper contains 12 sections, 5 equations, 2 figures, 1 table, 1 algorithm.

Figures (2)

  • Figure 1: An overview of our method. Different colours indicate different audio chunks and their corresponding transcriptions. Darker blocks in the cross-attention matrix indicate the audio frames most attended to by the current token. Upper part: Decoding is stopped when the most attended audio frame appears at the chunk boundary. Lower part: The unreliable last word is deleted from the transcription when truncation is detected and the model waits until the next chunk is received.
  • Figure 2: The WERs at different DALs for Whisper Large-v2, where computation-aware latency takes into account processing time, while computation-unaware latency doesn't. Chunk lengths vary from 0.5 to 1.0 seconds. For the same latency, the proposed method exhibits a significantly lower WER compared to the baseline, and the addition of IF further improves performance.