Table of Contents
Fetching ...

CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions

Laurin Wagner, Bernhard Thallinger, Mario Zusag

TL;DR

CrisperWhisper tackles the lack of precise word-level timestamps in Whisper by combining a retokenized vocabulary with a DTW-based alignment over cross-attention signals to timestamp tokens. It is trained on verbatim-focused datasets (AMI, PodcastFillers, CommonVoice) with noise augmentation and disfluency emphasis, including a minority of noise-only samples to reduce hallucinations. The approach yields state-of-the-art performance in word segmentation, disfluency localization, and verbatim transcription while preserving standard ASR accuracy on mainstream benchmarks and mitigating hallucinations in sensitive datasets. The work provides a practical, end-to-end method for accurate timing and disfluency analysis, with potential for broader language and domain adaptation.

Abstract

We demonstrate that carefully adjusting the tokenizer of the Whisper speech recognition model significantly improves the precision of word-level timestamps when applying dynamic time warping to the decoder's cross-attention scores. We fine-tune the model to produce more verbatim speech transcriptions and employ several techniques to increase robustness against multiple speakers and background noise. These adjustments achieve state-of-the-art performance on benchmarks for verbatim speech transcription, word segmentation, and the timed detection of filler events, and can further mitigate transcription hallucinations. The code is available open https://github.com/nyrahealth/CrisperWhisper.

CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions

TL;DR

CrisperWhisper tackles the lack of precise word-level timestamps in Whisper by combining a retokenized vocabulary with a DTW-based alignment over cross-attention signals to timestamp tokens. It is trained on verbatim-focused datasets (AMI, PodcastFillers, CommonVoice) with noise augmentation and disfluency emphasis, including a minority of noise-only samples to reduce hallucinations. The approach yields state-of-the-art performance in word segmentation, disfluency localization, and verbatim transcription while preserving standard ASR accuracy on mainstream benchmarks and mitigating hallucinations in sensitive datasets. The work provides a practical, end-to-end method for accurate timing and disfluency analysis, with potential for broader language and domain adaptation.

Abstract

We demonstrate that carefully adjusting the tokenizer of the Whisper speech recognition model significantly improves the precision of word-level timestamps when applying dynamic time warping to the decoder's cross-attention scores. We fine-tune the model to produce more verbatim speech transcriptions and employ several techniques to increase robustness against multiple speakers and background noise. These adjustments achieve state-of-the-art performance on benchmarks for verbatim speech transcription, word segmentation, and the timed detection of filler events, and can further mitigate transcription hallucinations. The code is available open https://github.com/nyrahealth/CrisperWhisper.
Paper Structure (19 sections, 1 equation, 4 figures, 2 tables)

This paper contains 19 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Example of a DTW path through the cross-attention weights matrix of Whisper large-v2 as in lintoai2023whispertimestamped. White lines represent the ground truth.
  • Figure 2: Example of a DTW path through the cross-attention weights matrix after CrisperWhisper retokenization. White lines represent the ground truth.
  • Figure 3: Word segmentation performance showing the F1-score for different collar values.
  • Figure 4: Disfluency Localization Performance