Adopting Whisper for Confidence Estimation
Vaibhav Aggarwal, Shabari S Nair, Yash Verma, Yash Jogi
TL;DR
This paper tackles word-level confidence estimation for ASR by replacing hand-crafted CEMs with an end-to-end approach that fine-tunes the Whisper decoder to emit per-token confidences. It defines per-token confidences $c(t_i)$ via $c(t_i) = \sigma(W_c^{T} h_i + B_c)$ with $h_i = D(e, t_{<i})$, and derives word-level confidence from the last token of each word, enabling an end-to-end, model-independent estimator named C-Whisper. Experiments on Common Voice and several out-of-domain datasets show that C-Whisper-tiny matches or surpasses CEM in most settings, while C-Whisper-large achieves the best performance across all metrics and even outperforms a commercial ASR service in some cases, illustrating strong cross-domain generalization. The results suggest that leveraging a pre-trained ASR foundation model for confidence estimation yields substantial gains, with practical impact for improving transcript reliability and enabling robust active-learning and quality-control workflows.
Abstract
Recent research on word-level confidence estimation for speech recognition systems has primarily focused on lightweight models known as Confidence Estimation Modules (CEMs), which rely on hand-engineered features derived from Automatic Speech Recognition (ASR) outputs. In contrast, we propose a novel end-to-end approach that leverages the ASR model itself (Whisper) to generate word-level confidence scores. Specifically, we introduce a method in which the Whisper model is fine-tuned to produce scalar confidence scores given an audio input and its corresponding hypothesis transcript. Our experiments demonstrate that the fine-tuned Whisper-tiny model, comparable in size to a strong CEM baseline, achieves similar performance on the in-domain dataset and surpasses the CEM baseline on eight out-of-domain datasets, whereas the fine-tuned Whisper-large model consistently outperforms the CEM baseline by a substantial margin across all datasets.
