Table of Contents
Fetching ...

Whispy: Adapting STT Whisper Models to Real-Time Environments

Antonio Bevilacqua, Paolo Saviano, Alessandro Amirante, Simon Pietro Romano

TL;DR

Whispy tackles the challenge of turning Whisper's offline ASR capabilities into a real-time streaming system by processing short audio chunks in a shifting buffer and aligning overlapping transcripts with a Levenshtein-based agreement. It integrates faster-whisper with Silero VAD, an RTP-based input pipeline, and a data-register-driven transcriber to reduce latency while maintaining accuracy, including a lightweight hallucination filter. Evaluations on diverse benchmarks show Whispy achieves transcription quality close to offline Whisper (within 1–2% WER on most datasets) with latency in the sub-second to around 1.5 s range depending on model size and chunk settings, demonstrating practical viability for real-time conferencing and streaming. The work also outlines future directions, such as diarization, summarization, and multimodal extensions, and plans to release production data to advance ASR research.

Abstract

Large general-purpose transformer models have recently become the mainstay in the realm of speech analysis. In particular, Whisper achieves state-of-the-art results in relevant tasks such as speech recognition, translation, language identification, and voice activity detection. However, Whisper models are not designed to be used in real-time conditions, and this limitation makes them unsuitable for a vast plethora of practical applications. In this paper, we introduce Whispy, a system intended to bring live capabilities to the Whisper pretrained models. As a result of a number of architectural optimisations, Whispy is able to consume live audio streams and generate high level, coherent voice transcriptions, while still maintaining a low computational cost. We evaluate the performance of our system on a large repository of publicly available speech datasets, investigating how the transcription mechanism introduced by Whispy impacts on the Whisper output. Experimental results show how Whispy excels in robustness, promptness, and accuracy.

Whispy: Adapting STT Whisper Models to Real-Time Environments

TL;DR

Whispy tackles the challenge of turning Whisper's offline ASR capabilities into a real-time streaming system by processing short audio chunks in a shifting buffer and aligning overlapping transcripts with a Levenshtein-based agreement. It integrates faster-whisper with Silero VAD, an RTP-based input pipeline, and a data-register-driven transcriber to reduce latency while maintaining accuracy, including a lightweight hallucination filter. Evaluations on diverse benchmarks show Whispy achieves transcription quality close to offline Whisper (within 1–2% WER on most datasets) with latency in the sub-second to around 1.5 s range depending on model size and chunk settings, demonstrating practical viability for real-time conferencing and streaming. The work also outlines future directions, such as diarization, summarization, and multimodal extensions, and plans to release production data to advance ASR research.

Abstract

Large general-purpose transformer models have recently become the mainstay in the realm of speech analysis. In particular, Whisper achieves state-of-the-art results in relevant tasks such as speech recognition, translation, language identification, and voice activity detection. However, Whisper models are not designed to be used in real-time conditions, and this limitation makes them unsuitable for a vast plethora of practical applications. In this paper, we introduce Whispy, a system intended to bring live capabilities to the Whisper pretrained models. As a result of a number of architectural optimisations, Whispy is able to consume live audio streams and generate high level, coherent voice transcriptions, while still maintaining a low computational cost. We evaluate the performance of our system on a large repository of publicly available speech datasets, investigating how the transcription mechanism introduced by Whispy impacts on the Whisper output. Experimental results show how Whispy excels in robustness, promptness, and accuracy.
Paper Structure (15 sections, 6 figures, 1 table, 1 algorithm)

This paper contains 15 sections, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: General Whispy service architecture. The overall system lives within an HTTP server, that we use as interface to set up the incoming stream, define the transcription destination, and update options such as model size or Voice Activity Detection (VAD) parameters.
  • Figure 2: Practical example of the Whispy suggestion mechanism.
  • Figure 3: Across all the tested datasets, excluding rev16, each instance of our Whispy implementation performs within a 1-2% negative difference from its corresponding offline Whisper version. Each box in the graph represents the distribution of the Word Error Rate scored by the labeled model.
  • Figure 4: Diagram of the critical differences among the tested models. The continuous bold line suggests there are no statistically significant differences in the obtained results, despite offline Whisper models performing, on average, better than real-time Whispy models.
  • Figure 5: Pairwisewise comparison of offline Whisper transcription WER (x axis) against real-time Whispy transcription WER (y axis). Data points above the quadrant bisector represent audio clips for which Whispy scored lower than Whisper (higher WER), while the region below the quadrant bisector contains all data points for which Whispy scored higher than Whisper (lower WER).
  • ...and 1 more figures