Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation
Zhennan Lin, Kaixun Huang, Wei Ren, Linju Yang, Lei Xie
TL;DR
The paper tackles the challenge of recognizing rare contextual phrases in end-to-end ASR by shifting from subword-level biasing to encoder-based phrase-level biasing with dynamic vocabulary tokens. It introduces a context encoder, bias-aware module, and a dynamic output layer that expands the vocabulary to include contextual phrases, trained with a bias loss and a confidence-activated decoding step. The approach yields substantial relative reductions in WER for both LibriSpeech and Wenetspeech, and notably large improvements on contextual-phrase recognition in both English and Chinese. Key contributions include the dynamic vocabulary design, two context-labeling strategies (WR and TA) with a preference for TA, and a decoding mechanism that uses CTC posterior confidence to correctly replace bias tokens with full phrases. Overall, the method demonstrates robust, scalable contextual biasing for diverse vocabularies with significant practical impact for real-world ASR deployments.
Abstract
Deep biasing improves automatic speech recognition (ASR) performance by incorporating contextual phrases. However, most existing methods enhance subwords in a contextual phrase as independent units, potentially compromising contextual phrase integrity, leading to accuracy reduction. In this paper, we propose an encoder-based phrase-level contextualized ASR method that leverages dynamic vocabulary prediction and activation. We introduce architectural optimizations and integrate a bias loss to extend phrase-level predictions based on frame-level outputs. We also introduce a confidence-activated decoding method that ensures the complete output of contextual phrases while suppressing incorrect bias. Experiments on Librispeech and Wenetspeech datasets demonstrate that our approach achieves relative WER reductions of 28.31% and 23.49% compared to baseline, with the WER on contextual phrases decreasing relatively by 72.04% and 75.69%.
