Table of Contents
Fetching ...

SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR

Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

TL;DR

This paper addresses the challenge of deploying large MoE-based end-to-end ASR on edge devices by introducing Speaker Adaptive Mixture of LoRA Experts (SAML), which uses LoRA adapters as lightweight experts and a dynamic router for speaker-specific adaptation. SAML is integrated into the PQM framework with block-wise NF4 quantisation and a two-stage pretraining plus speaker-adaptation procedure to mitigate quantisation-induced degradation. On LibriSpeech and TED-LIUM 3, with quantised Whisper and Conformer AED models, SAML achieves about a $29.1\%$ and $31.1\%$ relative WER reduction respectively at roughly a $7\times$ size reduction, outperforming single-LoRA baselines. The approach enables efficient, personalised ASR for edge devices and offers insights into MoE pruning and deployment, highlighting practical paths toward robust, adaptive, end-to-end ASR systems.

Abstract

Mixture-of-experts (MoE) models have achieved excellent results in many tasks. However, conventional MoE models are often very large, making them challenging to deploy on resource-constrained edge devices. In this paper, we propose a novel speaker adaptive mixture of LoRA experts (SAML) approach, which uses low-rank adaptation (LoRA) modules as experts to reduce the number of trainable parameters in MoE. Specifically, SAML is applied to the quantised and personalised end-to-end automatic speech recognition models, which combines test-time speaker adaptation to improve the performance of heavily compressed models in speaker-specific scenarios. Experiments have been performed on the LibriSpeech and the TED-LIUM 3 corpora. Remarkably, with a 7x reduction in model size, 29.1% and 31.1% relative word error rate reductions were achieved on the quantised Whisper model and Conformer-based attention-based encoder-decoder ASR model respectively, comparing to the original full precision models.

SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR

TL;DR

This paper addresses the challenge of deploying large MoE-based end-to-end ASR on edge devices by introducing Speaker Adaptive Mixture of LoRA Experts (SAML), which uses LoRA adapters as lightweight experts and a dynamic router for speaker-specific adaptation. SAML is integrated into the PQM framework with block-wise NF4 quantisation and a two-stage pretraining plus speaker-adaptation procedure to mitigate quantisation-induced degradation. On LibriSpeech and TED-LIUM 3, with quantised Whisper and Conformer AED models, SAML achieves about a and relative WER reduction respectively at roughly a size reduction, outperforming single-LoRA baselines. The approach enables efficient, personalised ASR for edge devices and offers insights into MoE pruning and deployment, highlighting practical paths toward robust, adaptive, end-to-end ASR systems.

Abstract

Mixture-of-experts (MoE) models have achieved excellent results in many tasks. However, conventional MoE models are often very large, making them challenging to deploy on resource-constrained edge devices. In this paper, we propose a novel speaker adaptive mixture of LoRA experts (SAML) approach, which uses low-rank adaptation (LoRA) modules as experts to reduce the number of trainable parameters in MoE. Specifically, SAML is applied to the quantised and personalised end-to-end automatic speech recognition models, which combines test-time speaker adaptation to improve the performance of heavily compressed models in speaker-specific scenarios. Experiments have been performed on the LibriSpeech and the TED-LIUM 3 corpora. Remarkably, with a 7x reduction in model size, 29.1% and 31.1% relative word error rate reductions were achieved on the quantised Whisper model and Conformer-based attention-based encoder-decoder ASR model respectively, comparing to the original full precision models.
Paper Structure (15 sections, 5 equations, 3 figures, 4 tables)

This paper contains 15 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of the SAML integrated into PQM framework.
  • Figure 2: The SAML architecture. Each attention layer is replaced with the SAML layer, and a LoRA module is added to each feed-forward layer.
  • Figure 3: t-SNE visualisation of Whisper-SAML and Whisper-LoRA encoder outputs, with different colours for each speaker.