Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation
Ruizhe Huang, Mahsa Yarmohammadi, Sanjeev Khudanpur, Daniel Povey
TL;DR
This work addresses the challenge of recognizing rare words and named entities in end-to-end ASR by strengthening neural contextual biasing. It introduces two techniques: (i) injecting biasing contexts into intermediate encoder layers to propagate contextual information through self-attention, and (ii) training-time text perturbation using alternative spellings to force the model to rely on the provided context. Implemented on a transducer with a Zipformer encoder and cross-attention biasing adapters, the approach achieves state-of-the-art results on LibriSpeech contextual ASR and shows solid gains on SPGISpeech and ConEC, while keeping decoding overhead practical. The findings demonstrate that simple, scalable context incorporation and perturbation-based training can substantially improve rare-word recognition in contextual ASR, with broad implications for personalized or domain-specific speech systems.
Abstract
Existing research suggests that automatic speech recognition (ASR) models can benefit from additional contexts (e.g., contact lists, user specified vocabulary). Rare words and named entities can be better recognized with contexts. In this work, we propose two simple yet effective techniques to improve context-aware ASR models. First, we inject contexts into the encoders at an early stage instead of merely at their last layers. Second, to enforce the model to leverage the contexts during training, we perturb the reference transcription with alternative spellings so that the model learns to rely on the contexts to make correct predictions. On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion, making the new state-of-the-art performance. On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.
