Table of Contents
Fetching ...

On-the-fly Routing for Zero-shot MoE Speaker Adaptation of Speech Foundation Models for Dysarthric Speech Recognition

Shujie HU, Xurong Xie, Mengzhe Geng, Jiajun Deng, Huimeng Wang, Guinan Li, Chengxi Deng, Tianzi Wang, Mingyu Cui, Helen Meng, Xunying Liu

TL;DR

This work tackles the difficulty of recognizing dysarthric speech by introducing an on-the-fly Mixture of Experts (MoE) approach that performs zero-shot, real-time adaptation within speech foundation models. It jointly uses severity- and gender-conditioned adapters, a KL-divergence term to promote expert diversity, and a routing network that predicts speaker-dependent routing parameters on the fly. Across UASpeech, the method yields up to $1.34\%$ absolute WER improvement over unadapted baselines and up to $7\times$ Real-Time Factor (RTF) speedups compared with batch-mode adaptation, achieving a new best WER of $16.35\%$ after cross-system rescoring. The approach demonstrates improved interpretability through routing parameter heatmaps and supports practical deployment for dysarthric speech recognition with robust generalization to unseen speakers.

Abstract

This paper proposes a novel MoE-based speaker adaptation framework for foundation models based dysarthric speech recognition. This approach enables zero-shot adaptation and real-time processing while incorporating domain knowledge. Speech impairment severity and gender conditioned adapter experts are dynamically combined using on-the-fly predicted speaker-dependent routing parameters. KL-divergence is used to further enforce diversity among experts and their generalization to unseen speakers. Experimental results on the UASpeech corpus suggest that on-the-fly MoE-based adaptation produces statistically significant WER reductions of up to 1.34% absolute (6.36% relative) over the unadapted baseline HuBERT/WavLM models. Consistent WER reductions of up to 2.55% absolute (11.44% relative) and RTF speedups of up to 7 times are obtained over batch-mode adaptation across varying speaker-level data quantities. The lowest published WER of 16.35% (46.77% on very low intelligibility) is obtained.

On-the-fly Routing for Zero-shot MoE Speaker Adaptation of Speech Foundation Models for Dysarthric Speech Recognition

TL;DR

This work tackles the difficulty of recognizing dysarthric speech by introducing an on-the-fly Mixture of Experts (MoE) approach that performs zero-shot, real-time adaptation within speech foundation models. It jointly uses severity- and gender-conditioned adapters, a KL-divergence term to promote expert diversity, and a routing network that predicts speaker-dependent routing parameters on the fly. Across UASpeech, the method yields up to absolute WER improvement over unadapted baselines and up to Real-Time Factor (RTF) speedups compared with batch-mode adaptation, achieving a new best WER of after cross-system rescoring. The approach demonstrates improved interpretability through routing parameter heatmaps and supports practical deployment for dysarthric speech recognition with robust generalization to unseen speakers.

Abstract

This paper proposes a novel MoE-based speaker adaptation framework for foundation models based dysarthric speech recognition. This approach enables zero-shot adaptation and real-time processing while incorporating domain knowledge. Speech impairment severity and gender conditioned adapter experts are dynamically combined using on-the-fly predicted speaker-dependent routing parameters. KL-divergence is used to further enforce diversity among experts and their generalization to unseen speakers. Experimental results on the UASpeech corpus suggest that on-the-fly MoE-based adaptation produces statistically significant WER reductions of up to 1.34% absolute (6.36% relative) over the unadapted baseline HuBERT/WavLM models. Consistent WER reductions of up to 2.55% absolute (11.44% relative) and RTF speedups of up to 7 times are obtained over batch-mode adaptation across varying speaker-level data quantities. The lowest published WER of 16.35% (46.77% on very low intelligibility) is obtained.

Paper Structure

This paper contains 10 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Architecture of MoE-based adaptation on SFM, where the routing parameters are derived from either a) speaker-dependent parameters in batch mode; or b) an on-the-fly routing network. "LN", "MHSA", "DP" and "FFN" are layernorm, multi-head self-attention, dropout and feedforward.
  • Figure 2: Examples of on-the-fly (a) & b)) and batch-mode (a) & c)) MoE-based speaker adaptation on the SFM. Routing parameters $\bm{r}^{s}$ in a) serve as SD parameters, while experts are shared by all speakers. The line charts in b) and c) illustrate the variation in a specific expert's routing parameters as a function of utterance count.
  • Figure 3: T-SNE visualization of the on-the-fly MoE-based, i-vector, and x-vector adaptation. The determinants of their covariance matrices are shown in each bracket.
  • Figure 4: WER and cosine similarity on HuBERT systems with varying amounts of speaker adaptation data.
  • Figure 5: Heatmap visualization of routing parameters under varying settings on HuBERT: a) batch-mode without domain knowledge, b) batch-mode and c) on-the-fly with domain knowledge. "Severity Label: {Spk_ids}" is given at the top of Figure.