Table of Contents
Fetching ...

XCB: an effective contextual biasing approach to bias cross-lingual phrases in speech recognition

Xucheng Wan, Naijun Zheng, Kai Liu, Huan Zhou

TL;DR

This work tackles cross-lingual biasing in code-switching ASR by introducing Cross-lingual Contextual Biasing (XCB), a lightweight module inserted between the SeACo-Paraformer encoder and predictor, along with a Language Specific Loss $L_{CE}^{2nd}$ that guides learning of $L_{2nd}$ tokens. The XCB comprises a Language Biasing Adapter and a Biasing Merging Gate, enabling biased acoustic embeddings via $E_{lb}=BMGate(H, LBAdapter(H))$ while preserving dominant-language performance. Empirically, XCB achieves substantial improvements in biasing metrics such as $BWER$ and $BMER$ on in-house Mandarin-English data and generalizes to the unseen ASRU-2019 test set, even without additional fine-tuning, with the total loss given by $L_{total} = L_{ASR} + L_{bias} + \alpha L_{CE}^{2nd}$. The results indicate a practical, inference-efficient path to robust multilingual, code-switching ASR with hotword biasing, though further analysis is needed to understand why an inactive biasing configuration can outperform an active one in some settings.

Abstract

Contextualized ASR models have been demonstrated to effectively improve the recognition accuracy of uncommon phrases when a predefined phrase list is available. However, these models often struggle with bilingual settings, which are prevalent in code-switching speech recognition. In this study, we make the initial attempt to address this challenge by introducing a Cross-lingual Contextual Biasing(XCB) module. Specifically, we augment a pre-trained ASR model for the dominant language by integrating an auxiliary language biasing module and a supplementary language-specific loss, aimed at enhancing the recognition of phrases in the secondary language. Experimental results conducted on our in-house code-switching dataset have validated the efficacy of our approach, demonstrating significant improvements in the recognition of biasing phrases in the secondary language, even without any additional inference overhead. Additionally, our proposed system exhibits both efficiency and generalization when is applied by the unseen ASRU-2019 test set.

XCB: an effective contextual biasing approach to bias cross-lingual phrases in speech recognition

TL;DR

This work tackles cross-lingual biasing in code-switching ASR by introducing Cross-lingual Contextual Biasing (XCB), a lightweight module inserted between the SeACo-Paraformer encoder and predictor, along with a Language Specific Loss that guides learning of tokens. The XCB comprises a Language Biasing Adapter and a Biasing Merging Gate, enabling biased acoustic embeddings via while preserving dominant-language performance. Empirically, XCB achieves substantial improvements in biasing metrics such as and on in-house Mandarin-English data and generalizes to the unseen ASRU-2019 test set, even without additional fine-tuning, with the total loss given by . The results indicate a practical, inference-efficient path to robust multilingual, code-switching ASR with hotword biasing, though further analysis is needed to understand why an inactive biasing configuration can outperform an active one in some settings.

Abstract

Contextualized ASR models have been demonstrated to effectively improve the recognition accuracy of uncommon phrases when a predefined phrase list is available. However, these models often struggle with bilingual settings, which are prevalent in code-switching speech recognition. In this study, we make the initial attempt to address this challenge by introducing a Cross-lingual Contextual Biasing(XCB) module. Specifically, we augment a pre-trained ASR model for the dominant language by integrating an auxiliary language biasing module and a supplementary language-specific loss, aimed at enhancing the recognition of phrases in the secondary language. Experimental results conducted on our in-house code-switching dataset have validated the efficacy of our approach, demonstrating significant improvements in the recognition of biasing phrases in the secondary language, even without any additional inference overhead. Additionally, our proposed system exhibits both efficiency and generalization when is applied by the unseen ASRU-2019 test set.
Paper Structure (13 sections, 3 equations, 1 figure, 2 tables)

This paper contains 13 sections, 3 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Illustration of the proposed XCB-enhancement on the SeACo-Paraformer: (a) the overall architecture; (b) detailed structure of LB Adapter and BM Gate components.