Table of Contents
Fetching ...

Speaker Adaptation for Quantised End-to-End ASR Models

Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

TL;DR

This work tackles the challenge of deploying large end-to-end ASR models on edge devices by introducing P4Q, a three-stage strategy that combines block-wise NF4 quantisation, LoRA-based speaker adaptation with pretraining, and test-time speaker adaptation. The approach maintains accuracy while delivering substantial size reductions, enabling practical edge deployment. Empirical results on Whisper and Conformer AED models show sizable relative WER improvements (up to ~25%) over quantised baselines on LibriSpeech and TED-LIUM 3, validating the effectiveness of speaker-adaptive quantisation. Overall, P4Q demonstrates a viable path to personalized, compact, and high-performance end-to-end ASR for resource-constrained devices.

Abstract

End-to-end models have shown superior performance for automatic speech recognition (ASR). However, such models are often very large in size and thus challenging to deploy on resource-constrained edge devices. While quantisation can reduce model sizes, it can lead to increased word error rates (WERs). Although improved quantisation methods were proposed to address the issue of performance degradation, the fact that quantised models deployed on edge devices often target only on a small group of users is under-explored. To this end, we propose personalisation for quantised models (P4Q), a novel strategy that uses speaker adaptation (SA) to improve quantised end-to-end ASR models by fitting them to the characteristics of the target speakers. In this paper, we study the P4Q strategy based on Whisper and Conformer attention-based encoder-decoder (AED) end-to-end ASR models, which leverages a 4-bit block-wise NormalFloat4 (NF4) approach for quantisation and the low-rank adaptation (LoRA) approach for SA. Experimental results on the LibriSpeech and the TED-LIUM 3 corpora show that, with a 7-time reduction in model size and 1% extra speaker-specific parameters, 15.1% and 23.3% relative WER reductions were achieved on quantised Whisper and Conformer AED models respectively, comparing to the full precision models.

Speaker Adaptation for Quantised End-to-End ASR Models

TL;DR

This work tackles the challenge of deploying large end-to-end ASR models on edge devices by introducing P4Q, a three-stage strategy that combines block-wise NF4 quantisation, LoRA-based speaker adaptation with pretraining, and test-time speaker adaptation. The approach maintains accuracy while delivering substantial size reductions, enabling practical edge deployment. Empirical results on Whisper and Conformer AED models show sizable relative WER improvements (up to ~25%) over quantised baselines on LibriSpeech and TED-LIUM 3, validating the effectiveness of speaker-adaptive quantisation. Overall, P4Q demonstrates a viable path to personalized, compact, and high-performance end-to-end ASR for resource-constrained devices.

Abstract

End-to-end models have shown superior performance for automatic speech recognition (ASR). However, such models are often very large in size and thus challenging to deploy on resource-constrained edge devices. While quantisation can reduce model sizes, it can lead to increased word error rates (WERs). Although improved quantisation methods were proposed to address the issue of performance degradation, the fact that quantised models deployed on edge devices often target only on a small group of users is under-explored. To this end, we propose personalisation for quantised models (P4Q), a novel strategy that uses speaker adaptation (SA) to improve quantised end-to-end ASR models by fitting them to the characteristics of the target speakers. In this paper, we study the P4Q strategy based on Whisper and Conformer attention-based encoder-decoder (AED) end-to-end ASR models, which leverages a 4-bit block-wise NormalFloat4 (NF4) approach for quantisation and the low-rank adaptation (LoRA) approach for SA. Experimental results on the LibriSpeech and the TED-LIUM 3 corpora show that, with a 7-time reduction in model size and 1% extra speaker-specific parameters, 15.1% and 23.3% relative WER reductions were achieved on quantised Whisper and Conformer AED models respectively, comparing to the full precision models.
Paper Structure (4 sections, 1 equation, 2 tables)

This paper contains 4 sections, 1 equation, 2 tables.