Table of Contents
Fetching ...

Adopting Whisper for Confidence Estimation

Vaibhav Aggarwal, Shabari S Nair, Yash Verma, Yash Jogi

TL;DR

This paper tackles word-level confidence estimation for ASR by replacing hand-crafted CEMs with an end-to-end approach that fine-tunes the Whisper decoder to emit per-token confidences. It defines per-token confidences $c(t_i)$ via $c(t_i) = \sigma(W_c^{T} h_i + B_c)$ with $h_i = D(e, t_{<i})$, and derives word-level confidence from the last token of each word, enabling an end-to-end, model-independent estimator named C-Whisper. Experiments on Common Voice and several out-of-domain datasets show that C-Whisper-tiny matches or surpasses CEM in most settings, while C-Whisper-large achieves the best performance across all metrics and even outperforms a commercial ASR service in some cases, illustrating strong cross-domain generalization. The results suggest that leveraging a pre-trained ASR foundation model for confidence estimation yields substantial gains, with practical impact for improving transcript reliability and enabling robust active-learning and quality-control workflows.

Abstract

Recent research on word-level confidence estimation for speech recognition systems has primarily focused on lightweight models known as Confidence Estimation Modules (CEMs), which rely on hand-engineered features derived from Automatic Speech Recognition (ASR) outputs. In contrast, we propose a novel end-to-end approach that leverages the ASR model itself (Whisper) to generate word-level confidence scores. Specifically, we introduce a method in which the Whisper model is fine-tuned to produce scalar confidence scores given an audio input and its corresponding hypothesis transcript. Our experiments demonstrate that the fine-tuned Whisper-tiny model, comparable in size to a strong CEM baseline, achieves similar performance on the in-domain dataset and surpasses the CEM baseline on eight out-of-domain datasets, whereas the fine-tuned Whisper-large model consistently outperforms the CEM baseline by a substantial margin across all datasets.

Adopting Whisper for Confidence Estimation

TL;DR

This paper tackles word-level confidence estimation for ASR by replacing hand-crafted CEMs with an end-to-end approach that fine-tunes the Whisper decoder to emit per-token confidences. It defines per-token confidences via with , and derives word-level confidence from the last token of each word, enabling an end-to-end, model-independent estimator named C-Whisper. Experiments on Common Voice and several out-of-domain datasets show that C-Whisper-tiny matches or surpasses CEM in most settings, while C-Whisper-large achieves the best performance across all metrics and even outperforms a commercial ASR service in some cases, illustrating strong cross-domain generalization. The results suggest that leveraging a pre-trained ASR foundation model for confidence estimation yields substantial gains, with practical impact for improving transcript reliability and enabling robust active-learning and quality-control workflows.

Abstract

Recent research on word-level confidence estimation for speech recognition systems has primarily focused on lightweight models known as Confidence Estimation Modules (CEMs), which rely on hand-engineered features derived from Automatic Speech Recognition (ASR) outputs. In contrast, we propose a novel end-to-end approach that leverages the ASR model itself (Whisper) to generate word-level confidence scores. Specifically, we introduce a method in which the Whisper model is fine-tuned to produce scalar confidence scores given an audio input and its corresponding hypothesis transcript. Our experiments demonstrate that the fine-tuned Whisper-tiny model, comparable in size to a strong CEM baseline, achieves similar performance on the in-domain dataset and surpasses the CEM baseline on eight out-of-domain datasets, whereas the fine-tuned Whisper-large model consistently outperforms the CEM baseline by a substantial margin across all datasets.

Paper Structure

This paper contains 6 sections, 3 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: C-Whisper: The model's decoder takes both the hypothesis transcript and encoder features as inputs and produces token-level confidence scores. To represent the confidence for an entire word, we use the confidence score of the last token of each word. Consequently, the confidence scores for tokens that do not occur at the end of a word are displayed in gray color.