Quantifying the Role of Textual Predictability in Automatic Speech Recognition
Sean Robertson, Gerald Penn, Ewan Dunbar
TL;DR
This work addresses how to attribute ASR errors to acoustics versus textual context by introducing a single-parameter metric $k$ that links context predictability to error rates via $e_c = e_i^k$ (equivalently $p_c = 1 - (1 - p_i)^k)$. The authors bin utterances by LM-driven NLL into zero-, low-, and high-predictability conditions and estimate $k$ across acoustic conditions, using non-linear regression with Wild bootstrap CIs. They evaluate multiple ASR systems (GMM, TDNN, Wav2Vec 2.0 base and large) on LibriSpeech and CORAAL datasets, finding that $k$ increases with more powerful/contextual models and with higher predictability bins, with W2V2-Large exhibiting the strongest dependence on textual predictability. On CORAAL data, higher $k$ values suggest increased reliance on context while supporting the conclusion that African-American English disparities primarily reflect acoustic-modelling challenges rather than textual predictability. The paper provides a practical diagnostic recipe and open-source tools to diagnose and improve ASR systems by balancing acoustic and textual modelling considerations.
Abstract
A long-standing question in automatic speech recognition research is how to attribute errors to the ability of a model to model the acoustics, versus its ability to leverage higher-order context (lexicon, morphology, syntax, semantics). We validate a novel approach which models error rates as a function of relative textual predictability, and yields a single number, $k$, which measures the effect of textual predictability on the recognizer. We use this method to demonstrate that a Wav2Vec 2.0-based model makes greater stronger use of textual context than a hybrid ASR model, in spite of not using an explicit language model, and also use it to shed light on recent results demonstrating poor performance of standard ASR systems on African-American English. We demonstrate that these mostly represent failures of acoustic--phonetic modelling. We show how this approach can be used straightforwardly in diagnosing and improving ASR.
