Dynamic Language Group-Based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing
Hukai Huang, Shenghui Lu, Yahui Shan, He Qu, Fengrun Zhang, Wenhao Guan, Qingyang Hong, Lin Li
TL;DR
This work introduces Dynamic Language Group-based MoE (DLG-MoE) for code-switching speech recognition, leveraging hierarchical routing to separately model language and within-language attributes such as accents and domains. A Shared Language Router (SLR) conducts frame-level language routing using a multi-task CTC-based objective that also guides ASR, while an UnSup-Router within each language group performs fine-grained routing to a subset of experts, enabling scalable MoE capacity. The model supports dynamic top-$k$ inference and streaming, and can be pruned to a monolingual sub-model, all trained end-to-end with a combined loss that integrates CTC, attention, and inter-task terms. Empirical results on ASRU-2019 Mandarin-English CS-ASR and Librispeech show substantial improvements over prior MoE approaches, with strong language-specific routing evidenced by high LID accuracy and robust performance across monolingual and code-switching scenarios, including efficient streaming and parameter pruning capabilities.
Abstract
The Mixture of Experts (MoE) model is a promising approach for handling code-switching speech recognition (CS-ASR) tasks. However, the existing CS-ASR work on MoE has yet to leverage the advantages of MoE's parameter scaling ability fully. This work proposes DLG-MoE, a Dynamic Language Group-based MoE, which can effectively handle the CS-ASR task and leverage the advantages of parameter scaling. DLG-MoE operates based on a hierarchical routing mechanism. First, the language router explicitly models the language attribute and dispatches the representations to the corresponding language expert groups. Subsequently, the unsupervised router within each language group implicitly models attributes beyond language and coordinates expert routing and collaboration. DLG-MoE outperforms the existing MoE methods on CS-ASR tasks while demonstrating great flexibility. It supports different top-$k$ inference and streaming capabilities and can also prune the model parameters flexibly to obtain a monolingual sub-model. The code has been released.
