Deep CLAS: Deep Contextual Listen, Attend and Spell

Mengzhi Wang; Shifu Xiong; Genshun Wan; Hang Chen; Jianqing Gao; Lirong Dai

Deep CLAS: Deep Contextual Listen, Attend and Spell

Mengzhi Wang, Shifu Xiong, Genshun Wan, Hang Chen, Jianqing Gao, Lirong Dai

TL;DR

Deep CLAS tackles the limited use of contextual bias in Contextual-LAS by introducing a bias loss and a richer bias-attention query, enabling more precise bias utilization during decoding. It further replaces phrase-level bias encoding with fine-grained character-level representations powered by conformer-based encoding, and it explicitly fuses bias attention into the output probabilities to make contextual bias decision-making explicit. On AISHELL-1 and AISHELL-NER, Deep CLAS delivers substantial gains in recall and F1 for named-entity recognition, reporting a 53.49% relative improvement in F1 over CLAS and notable gains in recall. Overall, the approach demonstrates a practical path to robust rare-word recognition in Mandarin ASR by leveraging structured contextual bias, richer queries, and explicit probability integration.

Abstract

Contextual-LAS (CLAS) has been shown effective in improving Automatic Speech Recognition (ASR) of rare words. It relies on phrase-level contextual modeling and attention-based relevance scoring without explicit contextual constraint which lead to insufficient use of contextual information. In this work, we propose deep CLAS to use contextual information better. We introduce bias loss forcing model to focus on contextual information. The query of bias attention is also enriched to improve the accuracy of the bias attention score. To get fine-grained contextual information, we replace phrase-level encoding with character-level encoding and encode contextual information with conformer rather than LSTM. Moreover, we directly use the bias attention score to correct the output probability distribution of the model. Experiments using the public AISHELL-1 and AISHELL-NER. On AISHELL-1, compared to CLAS baselines, deep CLAS obtains a 65.78% relative recall and a 53.49% relative F1-score increase in the named entity recognition scene.

Deep CLAS: Deep Contextual Listen, Attend and Spell

TL;DR

Abstract

Deep CLAS: Deep Contextual Listen, Attend and Spell

Authors

TL;DR

Abstract

Table of Contents

Figures (3)