Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition
Christian Huber, Alexander Waibel
TL;DR
This work tackles pronunciation–orthography mismatches in end-to-end ASR by introducing context biasing with on-the-fly corrections. It adds a context list Z to bias decoding and employs a context-attention mechanism with a token-boosting inference strategy, enhanced by a bias-phrase boosted beam search that selectively boosts the next token of the relevant bias entry. Evaluated on Earnings-21 and LibriSpeech with rare words, the proposed context biasing + replacement approach achieves up to a 8% relative reduction in biased word error rate (BWER) while maintaining competitive overall WER, with gains depending on distractor level and boosting parameters. The method enables practical, user-driven corrections during inference and demonstrates meaningful improvements in challenging pronunciation–orthography cases, offering a principled path toward robust open-vocabulary ASR.
Abstract
Neural sequence-to-sequence systems deliver state-of-the-art performance for automatic speech recognition. When using appropriate modeling units, e.g., byte-pair encoded characters, these systems are in principal open vocabulary systems. In practice, however, they often fail to recognize words not seen during training, e.g., named entities, acronyms, or domain-specific special words. To address this problem, many context biasing methods have been proposed; however, for words with a pronunciation-orthography mismatch, these methods may still struggle. We propose a method which allows corrections of substitution errors to improve the recognition accuracy of such challenging words. Users can add corrections on the fly during inference. We show that with this method we get a relative improvement in biased word error rate of up to 8%, while maintaining a competitive overall word error rate.
