Automatic Speech Recognition System-Independent Word Error Rate Estimation
Chanho Park, Mingjie Chen, Thomas Hain
TL;DR
This work introduces System-Independent WER Estimation (SIWE), a framework to estimate word error rate without relying on transcript-grounded ASR outputs. SIWE uses hypothesis generation directly from reference transcripts and data augmentation to simulate ASR-like errors, enabling system-agnostic WER prediction via regression, with a two-tower Fe-WER-inspired estimator as a baseline. On in-domain data SIWE matches ASR system-dependent WER estimators, while on out-of-domain data it achieves state-of-the-art performance, notably improving RMSE by 17.58% and PCC by 18.21% on Switchboard and CALLHOME compared to baselines; performance improves when the training WER distribution aligns with the evaluation set. The results highlight the importance of phonetic-substitution and linguistic-probability cues, and show that carefully matched WER ranges in training data yield the largest gains for SIWE in cross-domain scenarios.
Abstract
Word error rate (WER) is a metric used to evaluate the quality of transcriptions produced by Automatic Speech Recognition (ASR) systems. In many applications, it is of interest to estimate WER given a pair of a speech utterance and a transcript. Previous work on WER estimation focused on building models that are trained with a specific ASR system in mind (referred to as ASR system-dependent). These are also domain-dependent and inflexible in real-world applications. In this paper, a hypothesis generation method for ASR System-Independent WER estimation (SIWE) is proposed. In contrast to prior work, the WER estimators are trained using data that simulates ASR system output. Hypotheses are generated using phonetically similar or linguistically more likely alternative words. In WER estimation experiments, the proposed method reaches a similar performance to ASR system-dependent WER estimators on in-domain data and achieves state-of-the-art performance on out-of-domain data. On the out-of-domain data, the SIWE model outperformed the baseline estimators in root mean square error and Pearson correlation coefficient by relative 17.58% and 18.21%, respectively, on Switchboard and CALLHOME. The performance was further improved when the WER of the training set was close to the WER of the evaluation dataset.
