Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter
Andrei Andrusenko, Aleksandr Laptev, Vladimir Bataev, Vitaly Lavrukhin, Boris Ginsburg
TL;DR
This work tackles the challenge of biasing ASR toward rare and new words without costly model changes or beam-search; it introduces CTC-based Word Spotter (CTC-WS), which uses a Trie-CTC context graph to detect biasing candidates from CTC log-probabilities and merges them with greedy decoding. It extends to Transducer models via a Hybrid Transducer-CTC framework, enabling fast context-biasing for both CTC and Transducer ASR. Results show significant decoding speedups and improved F-scores and WER compared with shallow-fusion baselines, including robust handling of abbreviations and compound words. The approach, implemented in NVIDIA NeMo, offers a practical, scalable solution for real-world contextual ASR and points to streaming extensions as future work.
Abstract
Accurate recognition of rare and new words remains a pressing problem for contextualized Automatic Speech Recognition (ASR) systems. Most context-biasing methods involve modification of the ASR model or the beam-search decoding algorithm, complicating model reuse and slowing down inference. This work presents a new approach to fast context-biasing with CTC-based Word Spotter (CTC-WS) for CTC and Transducer (RNN-T) ASR models. The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates. The valid candidates then replace their greedy recognition counterparts in corresponding frame intervals. A Hybrid Transducer-CTC model enables the CTC-WS application for the Transducer model. The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER compared to baseline methods. The proposed method is publicly available in the NVIDIA NeMo toolkit.
