Table of Contents
Fetching ...

WhisperRT -- Turning Whisper into a Causal Streaming Model

Tomer Krichli, Bhiksha Raj, Joseph Keshet

Abstract

Automatic Speech Recognition (ASR) has seen remarkable progress, with models like OpenAI Whisper and NVIDIA Canary achieving state-of-the-art (SOTA) performance in offline transcription. However, these models are not designed for streaming (online or real-time) transcription, due to limitations in their architecture and training methodology. We propose a method to turn the transformer encoder-decoder model into a low-latency streaming model. The encoder is made causal to process audio incrementally, while the decoder conditions on partial encoder states to generate tokens aligned with the available temporal context. This requires explicit synchronization between encoded input frames and token emissions. Since tokens are produced only after sufficient acoustic evidence is observed, an inherent latency arises, necessitating fine-tuning of the encoder-decoder alignment mechanism. We propose an updated inference mechanism that utilizes the fine-tuned causal encoder and decoder to yield greedy and beam-search decoding, and is shown to be locally optimal. Experiments on low-latency chunk sizes (less than 300 msec) show that our fine-tuned model outperforms existing non-fine-tuned streaming approaches in most cases, while using a lower complexity. We release our training and inference code, along with the fine-tuned models, to support further research and development in streaming ASR.

WhisperRT -- Turning Whisper into a Causal Streaming Model

Abstract

Automatic Speech Recognition (ASR) has seen remarkable progress, with models like OpenAI Whisper and NVIDIA Canary achieving state-of-the-art (SOTA) performance in offline transcription. However, these models are not designed for streaming (online or real-time) transcription, due to limitations in their architecture and training methodology. We propose a method to turn the transformer encoder-decoder model into a low-latency streaming model. The encoder is made causal to process audio incrementally, while the decoder conditions on partial encoder states to generate tokens aligned with the available temporal context. This requires explicit synchronization between encoded input frames and token emissions. Since tokens are produced only after sufficient acoustic evidence is observed, an inherent latency arises, necessitating fine-tuning of the encoder-decoder alignment mechanism. We propose an updated inference mechanism that utilizes the fine-tuned causal encoder and decoder to yield greedy and beam-search decoding, and is shown to be locally optimal. Experiments on low-latency chunk sizes (less than 300 msec) show that our fine-tuned model outperforms existing non-fine-tuned streaming approaches in most cases, while using a lower complexity. We release our training and inference code, along with the fine-tuned models, to support further research and development in streaming ASR.

Paper Structure

This paper contains 31 sections, 2 theorems, 43 equations, 7 figures, 6 tables, 3 algorithms.

Key Result

Theorem 1

Let $T$ be the input sequence length to the encoder, $d$ the embedding dimension, and $\tau$ the chunk size, with $0 < \tau \ll T$. The computation of blocked causal attention over the full sequence during streaming requires $\mathcal{O}(T^2d + Td^2)$ operations and $\mathcal{O}(Td)$ additional memo

Figures (7)

  • Figure 1: Encoder causal mask example, $\tau=15, \tau_0=30$ given $k=10$ chunks. Such mask applies that the model waits $600$ msec for the first buffer before feeding the input to the encoder. Then, input is being fed every $300$ msec. Purple regions contain zeros while white regions contain $-\infty$. The index (35,50) is marked in a green point.
  • Figure 2: The inference process, using a chunk size of $\tau$, initial chunk of size $\tau_0=\tau$. The figure also illustrates how attention weight matrices are computed, specifically, within the encoder's self-attention and the decoder's cross-attention mechanisms.
  • Figure 3: Token distribution for the Whisper model (left) and WhisperRT (right) for third token over time, conditioned on the ground truth prefix 'she had.' The right color bar indicates the confidence scale for both models, with red regions representing higher confidence values. The full utterance is: "she had your dark suit in greasy wash water all year." The plots show the probability distribution of the third token, rather than the sequence of predicted tokens. As such, EOT is the predicted token until there is sufficient acoustic evidence to predict 'your.' Hence, the token 'had' does not appear before the token 'your.'
  • Figure 4: Illustration of the fine-tuning process. The example depicts an encoder operating with a chunk size of $300$ msec. This approach improves training efficiency by avoiding the need to process every possible frame in the streaming setting.
  • Figure 5: ARWER vs. Chunk Size per method, on large-v2 models. Left sub figure presents the results on LibriSpeech test-clean. Right sub figure presents the results on LibriSpeech test-other
  • ...and 2 more figures

Theorems & Definitions (11)

  • Definition 1: Stable token for greedy decoding
  • Claim 1: Streaming greedy decoding optimality
  • Definition 2: Top-$k$ operator
  • Definition 3: Stable token for beam search decoding
  • Theorem 1
  • proof : Proof of Property 1
  • proof : Proof of Property 2
  • Claim 2: Streaming Greedy Decoding Optimality
  • proof : Proof of Claim 1
  • Theorem 2
  • ...and 1 more