Table of Contents
Fetching ...

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Max Bain, Jaesung Huh, Tengda Han, Andrew Zisserman

TL;DR

WhisperX tackles long-form transcription by introducing a VAD-based pre-segmentation stage, a Cut & Merge strategy to produce ~30-second chunks, and an external forced phoneme alignment to yield precise word-level timestamps. This design enables parallel batched transcription with Whisper while maintaining time alignment, reducing boundary errors, drift, and hallucinations inherent to sliding-window approaches. Evaluation across AMI, SWB, TEDLIUM, and Kincaid46k shows state-of-the-art word segmentation and long-form transcription, plus a substantial speedup from batched inference. The work also explores multilingual transcription and translation modes, demonstrating that external phoneme alignment provides more reliable timestamps than Whisper-alone approaches.

Abstract

Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or sliding window approaches is prone to drifting, hallucination & repetition; and prohibits batched transcription due to their sequential nature. Further, timestamps corresponding each utterance are prone to inaccuracies and word-level timestamps are not available out-of-the-box. To overcome these challenges, we present WhisperX, a time-accurate speech recognition system with word-level timestamps utilising voice activity detection and forced phoneme alignment. In doing so, we demonstrate state-of-the-art performance on long-form transcription and word segmentation benchmarks. Additionally, we show that pre-segmenting audio with our proposed VAD Cut & Merge strategy improves transcription quality and enables a twelve-fold transcription speedup via batched inference.

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

TL;DR

WhisperX tackles long-form transcription by introducing a VAD-based pre-segmentation stage, a Cut & Merge strategy to produce ~30-second chunks, and an external forced phoneme alignment to yield precise word-level timestamps. This design enables parallel batched transcription with Whisper while maintaining time alignment, reducing boundary errors, drift, and hallucinations inherent to sliding-window approaches. Evaluation across AMI, SWB, TEDLIUM, and Kincaid46k shows state-of-the-art word segmentation and long-form transcription, plus a substantial speedup from batched inference. The work also explores multilingual transcription and translation modes, demonstrating that external phoneme alignment provides more reliable timestamps than Whisper-alone approaches.

Abstract

Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or sliding window approaches is prone to drifting, hallucination & repetition; and prohibits batched transcription due to their sequential nature. Further, timestamps corresponding each utterance are prone to inaccuracies and word-level timestamps are not available out-of-the-box. To overcome these challenges, we present WhisperX, a time-accurate speech recognition system with word-level timestamps utilising voice activity detection and forced phoneme alignment. In doing so, we demonstrate state-of-the-art performance on long-form transcription and word segmentation benchmarks. Additionally, we show that pre-segmenting audio with our proposed VAD Cut & Merge strategy improves transcription quality and enables a twelve-fold transcription speedup via batched inference.
Paper Structure (19 sections, 1 figure, 4 tables)

This paper contains 19 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: WhisperX: We present a system for efficient speech transcription of long-form audio with word-level time alignment. The input audio is first segmented with Voice Activity Detection and then cut & merged into approximately 30-second input chunks with boundaries that lie on minimally active speech regions. The resulting chunks are then: (i) transcribed in parallel with Whisper, and (ii) forced aligned with a phoneme recognition model to produce accurate word-level timestamps at high throughput.