Dynamic Language Group-Based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing

Hukai Huang; Shenghui Lu; Yahui Shan; He Qu; Fengrun Zhang; Wenhao Guan; Qingyang Hong; Lin Li

Dynamic Language Group-Based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing

Hukai Huang, Shenghui Lu, Yahui Shan, He Qu, Fengrun Zhang, Wenhao Guan, Qingyang Hong, Lin Li

TL;DR

This work introduces Dynamic Language Group-based MoE (DLG-MoE) for code-switching speech recognition, leveraging hierarchical routing to separately model language and within-language attributes such as accents and domains. A Shared Language Router (SLR) conducts frame-level language routing using a multi-task CTC-based objective that also guides ASR, while an UnSup-Router within each language group performs fine-grained routing to a subset of experts, enabling scalable MoE capacity. The model supports dynamic top-$k$ inference and streaming, and can be pruned to a monolingual sub-model, all trained end-to-end with a combined loss that integrates CTC, attention, and inter-task terms. Empirical results on ASRU-2019 Mandarin-English CS-ASR and Librispeech show substantial improvements over prior MoE approaches, with strong language-specific routing evidenced by high LID accuracy and robust performance across monolingual and code-switching scenarios, including efficient streaming and parameter pruning capabilities.

Abstract

The Mixture of Experts (MoE) model is a promising approach for handling code-switching speech recognition (CS-ASR) tasks. However, the existing CS-ASR work on MoE has yet to leverage the advantages of MoE's parameter scaling ability fully. This work proposes DLG-MoE, a Dynamic Language Group-based MoE, which can effectively handle the CS-ASR task and leverage the advantages of parameter scaling. DLG-MoE operates based on a hierarchical routing mechanism. First, the language router explicitly models the language attribute and dispatches the representations to the corresponding language expert groups. Subsequently, the unsupervised router within each language group implicitly models attributes beyond language and coordinates expert routing and collaboration. DLG-MoE outperforms the existing MoE methods on CS-ASR tasks while demonstrating great flexibility. It supports different top-$k$ inference and streaming capabilities and can also prune the model parameters flexibly to obtain a monolingual sub-model. The code has been released.

Dynamic Language Group-Based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing

TL;DR

inference and streaming, and can be pruned to a monolingual sub-model, all trained end-to-end with a combined loss that integrates CTC, attention, and inter-task terms. Empirical results on ASRU-2019 Mandarin-English CS-ASR and Librispeech show substantial improvements over prior MoE approaches, with strong language-specific routing evidenced by high LID accuracy and robust performance across monolingual and code-switching scenarios, including efficient streaming and parameter pruning capabilities.

Abstract

inference and streaming capabilities and can also prune the model parameters flexibly to obtain a monolingual sub-model. The code has been released.

Paper Structure (14 sections, 4 equations, 2 figures, 3 tables)

This paper contains 14 sections, 4 equations, 2 figures, 3 tables.

Introduction
Proposed Method
Model Architecture
Shared Language Router
Dynamic Language Groups
Experiments Setup
Datasets
Model Configurations
Evaluation Metrics
Experiments Results
Analysis of Results for Different MoE Structures
Analysis of Expert Capacities and Top-k strategy
Streaming Capability and Flexibility
Conclusions and future work

Figures (2)

Figure 1: (a) Overview of the proposed method, SLR means Shared Language Router. (b) DLG-MoE layer. (c) Dynamic Language Groups, the SLR performs language routing, followed by further routing within the language group by the UnSup-Router.
Figure 2: Visualisation of SLR's frame-level routing results, where red represents Chinese and green represents English.

Dynamic Language Group-Based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing

TL;DR

Abstract

Dynamic Language Group-Based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing

Authors

TL;DR

Abstract

Table of Contents

Figures (2)