SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR

Qiuming Zhao; Guangzhi Sun; Chao Zhang; Mingxing Xu; Thomas Fang Zheng

SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR

Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

TL;DR

This paper addresses the challenge of deploying large MoE-based end-to-end ASR on edge devices by introducing Speaker Adaptive Mixture of LoRA Experts (SAML), which uses LoRA adapters as lightweight experts and a dynamic router for speaker-specific adaptation. SAML is integrated into the PQM framework with block-wise NF4 quantisation and a two-stage pretraining plus speaker-adaptation procedure to mitigate quantisation-induced degradation. On LibriSpeech and TED-LIUM 3, with quantised Whisper and Conformer AED models, SAML achieves about a $29.1\%$ and $31.1\%$ relative WER reduction respectively at roughly a $7\times$ size reduction, outperforming single-LoRA baselines. The approach enables efficient, personalised ASR for edge devices and offers insights into MoE pruning and deployment, highlighting practical paths toward robust, adaptive, end-to-end ASR systems.

Abstract

Mixture-of-experts (MoE) models have achieved excellent results in many tasks. However, conventional MoE models are often very large, making them challenging to deploy on resource-constrained edge devices. In this paper, we propose a novel speaker adaptive mixture of LoRA experts (SAML) approach, which uses low-rank adaptation (LoRA) modules as experts to reduce the number of trainable parameters in MoE. Specifically, SAML is applied to the quantised and personalised end-to-end automatic speech recognition models, which combines test-time speaker adaptation to improve the performance of heavily compressed models in speaker-specific scenarios. Experiments have been performed on the LibriSpeech and the TED-LIUM 3 corpora. Remarkably, with a 7x reduction in model size, 29.1% and 31.1% relative word error rate reductions were achieved on the quantised Whisper model and Conformer-based attention-based encoder-decoder ASR model respectively, comparing to the original full precision models.

SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR

TL;DR

and

relative WER reduction respectively at roughly a

size reduction, outperforming single-LoRA baselines. The approach enables efficient, personalised ASR for edge devices and offers insights into MoE pruning and deployment, highlighting practical paths toward robust, adaptive, end-to-end ASR systems.

Abstract

Paper Structure (15 sections, 5 equations, 3 figures, 4 tables)

This paper contains 15 sections, 5 equations, 3 figures, 4 tables.

Introduction
Related work
Mixture-of-Experts
Speaker adaptation
Methodology
Preliminaries
Mixture-of-Experts
Low-Rank Adaptation
Speaker Adaptive Mixture of LoRA Experts
SAML integrated into PQM framework
Experimental setup
Data
Model and training specifications
Evaluation results and analysis
Conclusions

Figures (3)

Figure 1: Overview of the SAML integrated into PQM framework.
Figure 2: The SAML architecture. Each attention layer is replaced with the SAML layer, and a LoRA module is added to each feed-forward layer.
Figure 3: t-SNE visualisation of Whisper-SAML and Whisper-LoRA encoder outputs, with different colours for each speaker.

SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR

TL;DR

Abstract

SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR

Authors

TL;DR

Abstract

Table of Contents

Figures (3)