Table of Contents
Fetching ...

Quantifying the Role of Textual Predictability in Automatic Speech Recognition

Sean Robertson, Gerald Penn, Ewan Dunbar

TL;DR

This work addresses how to attribute ASR errors to acoustics versus textual context by introducing a single-parameter metric $k$ that links context predictability to error rates via $e_c = e_i^k$ (equivalently $p_c = 1 - (1 - p_i)^k)$. The authors bin utterances by LM-driven NLL into zero-, low-, and high-predictability conditions and estimate $k$ across acoustic conditions, using non-linear regression with Wild bootstrap CIs. They evaluate multiple ASR systems (GMM, TDNN, Wav2Vec 2.0 base and large) on LibriSpeech and CORAAL datasets, finding that $k$ increases with more powerful/contextual models and with higher predictability bins, with W2V2-Large exhibiting the strongest dependence on textual predictability. On CORAAL data, higher $k$ values suggest increased reliance on context while supporting the conclusion that African-American English disparities primarily reflect acoustic-modelling challenges rather than textual predictability. The paper provides a practical diagnostic recipe and open-source tools to diagnose and improve ASR systems by balancing acoustic and textual modelling considerations.

Abstract

A long-standing question in automatic speech recognition research is how to attribute errors to the ability of a model to model the acoustics, versus its ability to leverage higher-order context (lexicon, morphology, syntax, semantics). We validate a novel approach which models error rates as a function of relative textual predictability, and yields a single number, $k$, which measures the effect of textual predictability on the recognizer. We use this method to demonstrate that a Wav2Vec 2.0-based model makes greater stronger use of textual context than a hybrid ASR model, in spite of not using an explicit language model, and also use it to shed light on recent results demonstrating poor performance of standard ASR systems on African-American English. We demonstrate that these mostly represent failures of acoustic--phonetic modelling. We show how this approach can be used straightforwardly in diagnosing and improving ASR.

Quantifying the Role of Textual Predictability in Automatic Speech Recognition

TL;DR

This work addresses how to attribute ASR errors to acoustics versus textual context by introducing a single-parameter metric that links context predictability to error rates via (equivalently . The authors bin utterances by LM-driven NLL into zero-, low-, and high-predictability conditions and estimate across acoustic conditions, using non-linear regression with Wild bootstrap CIs. They evaluate multiple ASR systems (GMM, TDNN, Wav2Vec 2.0 base and large) on LibriSpeech and CORAAL datasets, finding that increases with more powerful/contextual models and with higher predictability bins, with W2V2-Large exhibiting the strongest dependence on textual predictability. On CORAAL data, higher values suggest increased reliance on context while supporting the conclusion that African-American English disparities primarily reflect acoustic-modelling challenges rather than textual predictability. The paper provides a practical diagnostic recipe and open-source tools to diagnose and improve ASR systems by balancing acoustic and textual modelling considerations.

Abstract

A long-standing question in automatic speech recognition research is how to attribute errors to the ability of a model to model the acoustics, versus its ability to leverage higher-order context (lexicon, morphology, syntax, semantics). We validate a novel approach which models error rates as a function of relative textual predictability, and yields a single number, , which measures the effect of textual predictability on the recognizer. We use this method to demonstrate that a Wav2Vec 2.0-based model makes greater stronger use of textual context than a hybrid ASR model, in spite of not using an explicit language model, and also use it to shed light on recent results demonstrating poor performance of standard ASR systems on African-American English. We demonstrate that these mostly represent failures of acoustic--phonetic modelling. We show how this approach can be used straightforwardly in diagnosing and improving ASR.
Paper Structure (11 sections, 4 equations, 3 figures, 3 tables)

This paper contains 11 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Accuracy ratios across $k$ from Equation \ref{['eq:k']}.
  • Figure 2: In-context vs. isolated accuracies of W2V2-L. The grey, dashed line is $y=x$. Black lines mark the interpolated fits over LS-C from \ref{['tab:k']}: the shallow curve is LP; the steep curve is HP.
  • Figure 3: Point-wise estimates of $k = \ln e_c / \ln e_i$vs. error rates $e_i$ of W2V2-L. Each point is paired by SNR and partition. Black lines mark the interpolated fits from \ref{['tab:k']}.