Table of Contents
Fetching ...

Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter

Andrei Andrusenko, Aleksandr Laptev, Vladimir Bataev, Vitaly Lavrukhin, Boris Ginsburg

TL;DR

This work tackles the challenge of biasing ASR toward rare and new words without costly model changes or beam-search; it introduces CTC-based Word Spotter (CTC-WS), which uses a Trie-CTC context graph to detect biasing candidates from CTC log-probabilities and merges them with greedy decoding. It extends to Transducer models via a Hybrid Transducer-CTC framework, enabling fast context-biasing for both CTC and Transducer ASR. Results show significant decoding speedups and improved F-scores and WER compared with shallow-fusion baselines, including robust handling of abbreviations and compound words. The approach, implemented in NVIDIA NeMo, offers a practical, scalable solution for real-world contextual ASR and points to streaming extensions as future work.

Abstract

Accurate recognition of rare and new words remains a pressing problem for contextualized Automatic Speech Recognition (ASR) systems. Most context-biasing methods involve modification of the ASR model or the beam-search decoding algorithm, complicating model reuse and slowing down inference. This work presents a new approach to fast context-biasing with CTC-based Word Spotter (CTC-WS) for CTC and Transducer (RNN-T) ASR models. The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates. The valid candidates then replace their greedy recognition counterparts in corresponding frame intervals. A Hybrid Transducer-CTC model enables the CTC-WS application for the Transducer model. The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER compared to baseline methods. The proposed method is publicly available in the NVIDIA NeMo toolkit.

Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter

TL;DR

This work tackles the challenge of biasing ASR toward rare and new words without costly model changes or beam-search; it introduces CTC-based Word Spotter (CTC-WS), which uses a Trie-CTC context graph to detect biasing candidates from CTC log-probabilities and merges them with greedy decoding. It extends to Transducer models via a Hybrid Transducer-CTC framework, enabling fast context-biasing for both CTC and Transducer ASR. Results show significant decoding speedups and improved F-scores and WER compared with shallow-fusion baselines, including robust handling of abbreviations and compound words. The approach, implemented in NVIDIA NeMo, offers a practical, scalable solution for real-world contextual ASR and points to streaming extensions as future work.

Abstract

Accurate recognition of rare and new words remains a pressing problem for contextualized Automatic Speech Recognition (ASR) systems. Most context-biasing methods involve modification of the ASR model or the beam-search decoding algorithm, complicating model reuse and slowing down inference. This work presents a new approach to fast context-biasing with CTC-based Word Spotter (CTC-WS) for CTC and Transducer (RNN-T) ASR models. The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates. The valid candidates then replace their greedy recognition counterparts in corresponding frame intervals. A Hybrid Transducer-CTC model enables the CTC-WS application for the Transducer model. The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER compared to baseline methods. The proposed method is publicly available in the NVIDIA NeMo toolkit.
Paper Structure (16 sections, 4 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: The proposed context-biasing method.
  • Figure 2: A context-biasing example for a CTC model.
  • Figure 3: Context graph -- a composition of a prefix tree with CTC transition topology generated for words "gpu" and "geforce". Blue and green arcs denote blank ($\varnothing$) transitions and self-loops for non-blank tokens, respectively.
  • Figure 4: Precision, Recall, and WER depending on context-biasing weight parameter for the CTC model with CTC-WS and fixed $ctc_{w}=0.5$ for the GTC test set.