Table of Contents
Fetching ...

GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR

Yujie Guo, Jiaming Zhou, Yuhang Jia, Shiwan Zhao, Yong Qin

Abstract

End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech. A critical bottleneck is that speaker-specific acoustic characteristics, which are essential for distinguishing overlapping speech, are often diluted in deep network layers. To address this, we propose the Global-Local Aware Dynamic Mixture-of-Experts (GLAD) architecture. GLAD introduces a novel routing mechanism that dynamically fuses speaker-aware global context with fine-grained local acoustic details to adaptively guide expert selection. Experiments on the LibriSpeechMix and CH109 datasets demonstrate that GLAD significantly outperforms existing Serialized Output Training (SOT)-based MTASR approaches, exhibiting exceptional robustness in challenging, high-overlap scenarios. To the best of our knowledge, this is the first work to apply a global-local fusion MoE strategy to MTASR.

GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR

Abstract

End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech. A critical bottleneck is that speaker-specific acoustic characteristics, which are essential for distinguishing overlapping speech, are often diluted in deep network layers. To address this, we propose the Global-Local Aware Dynamic Mixture-of-Experts (GLAD) architecture. GLAD introduces a novel routing mechanism that dynamically fuses speaker-aware global context with fine-grained local acoustic details to adaptively guide expert selection. Experiments on the LibriSpeechMix and CH109 datasets demonstrate that GLAD significantly outperforms existing Serialized Output Training (SOT)-based MTASR approaches, exhibiting exceptional robustness in challenging, high-overlap scenarios. To the best of our knowledge, this is the first work to apply a global-local fusion MoE strategy to MTASR.

Paper Structure

This paper contains 18 sections, 7 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of the proposed GLAD-SOT architecture. (a) A global linear encoder transforms features from the convolution frontend into a shared global representation, which is broadcast to each MoLE layer. (b) Each MoLE layer derives global weights from the shared global representation and integrates them with local signals to coordinate low-rank experts. (c) The global-local aware dynamic fusion module adaptively fuses weights to guide expert selection.
  • Figure 2: Visualization of the average global fusion weight ($\beta_g$) at different encoder layers across varying overlap scenarios.