CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition
He Wang, Xucheng Wan, Naijun Zheng, Kai Liu, Huan Zhou, Guojian Li, Lei Xie
TL;DR
Code-switching ASR remains challenging due to language confusion and scarce labeled data. CAMEL introduces a cross-attention enhanced MoE encoder with language adapters (EN-Adapter and CN-Adapter) and a gated cross-attention fusion to jointly model language-specific and cross-lingual speech representations, plus a language diarization (LD) decoder that biases text embeddings via cross-attention. The training objective combines a language-wise CTC loss $L_{lang-ctc}$ with standard CTC and CE losses, augmented by an LD-guided $L_{ld-ce}$. Empirically, CAMEL achieves state-of-the-art results on SEAME, ASRU200, and ASRU700+LibriSpeech460 Mandarin-English code-switching corpora, significantly outperforming baselines and prior SOTA methods. The work demonstrates that explicit cross-lingual contextual modeling and language bias integration substantially improve recognition accuracy in multilingual ASR and can be extended to more languages.
Abstract
Code-switching automatic speech recognition (ASR) aims to transcribe speech that contains two or more languages accurately. To better capture language-specific speech representations and address language confusion in code-switching ASR, the mixture-of-experts (MoE) architecture and an additional language diarization (LD) decoder are commonly employed. However, most researches remain stagnant in simple operations like weighted summation or concatenation to fuse languagespecific speech representations, leaving significant opportunities to explore the enhancement of integrating language bias information. In this paper, we introduce CAMEL, a cross-attention-based MoE and language bias approach for code-switching ASR. Specifically, after each MoE layer, we fuse language-specific speech representations with cross-attention, leveraging its strong contextual modeling abilities. Additionally, we design a source attention-based mechanism to incorporate the language information from the LD decoder output into text embeddings. Experimental results demonstrate that our approach achieves state-of-the-art performance on the SEAME, ASRU200, and ASRU700+LibriSpeech460 Mandarin-English code-switching ASR datasets.
