Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer
Peng Wang, Yifan Yang, Zheng Liang, Tian Tan, Shiliang Zhang, Xie Chen
TL;DR
The paper tackles the difficulty of named entity recognition in end-to-end ASR by long-tail surface forms, proposing C-FNT, a factorized neural Transducer augmented with a class-based language model. By scoring named entities through a dedicated @name class and constraining emission to a provided name list, C-FNT preserves standard ASR performance while substantially reducing named-entity errors, as evidenced by up to $7.2\%$–$7.6\%$ relative WER improvements and $27.9\%$–$30.8\%$ relative F1 gains on targeted NER tests. The decoding is carefully engineered with beam search and four status transitions to navigate the name class, and a dynamic beam size mitigates path duplication. Overall, the approach offers a modular, adaptable framework for NER in E2E ASR, enabling easy updates of name lists and domain-specific entities while maintaining strong general recognition performance, and it demonstrates the potential of integrating class-based LMs into E2E models.
Abstract
Despite advancements of end-to-end (E2E) models in speech recognition, named entity recognition (NER) is still challenging but critical for semantic understanding. Previous studies mainly focus on various rule-based or attention-based contextual biasing algorithms. However, their performance might be sensitive to the biasing weight or degraded by excessive attention to the named entity list, along with a risk of false triggering. Inspired by the success of the class-based language model (LM) in NER in conventional hybrid systems and the effective decoupling of acoustic and linguistic information in the factorized neural Transducer (FNT), we propose C-FNT, a novel E2E model that incorporates class-based LMs into FNT. In C-FNT, the LM score of named entities can be associated with the name class instead of its surface form. The experimental results show that our proposed C-FNT significantly reduces error in named entities without hurting performance in general word recognition.
