Table of Contents
Fetching ...

LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR

Zheshu Song, Jianheng Zhuo, Yifan Yang, Ziyang Ma, Shixiong Zhang, Xie Chen

TL;DR

The paper tackles language interference and catastrophic forgetting in multilingual ASR during language expansion. It introduces LoRA-Whisper, which attaches language-specific LoRA modules to the Whisper backbone, keeping shared knowledge in Whisper while storing language-specific information in LoRA. It also proposes LoRA warm start and LoRA MoE to leverage cross-language similarity for efficient and effective expansion to new languages. On MLS and FLEURS across eight languages, LoRA-Whisper achieves substantial gains over baselines with only a small fraction of trainable parameters, demonstrating a practical, parameter-efficient path to customizable multilingual ASR. The work has implications for building scalable, extensible speech systems based on foundation models.

Abstract

Recent years have witnessed significant progress in multilingual automatic speech recognition (ASR), driven by the emergence of end-to-end (E2E) models and the scaling of multilingual datasets. Despite that, two main challenges persist in multilingual ASR: language interference and the incorporation of new languages without degrading the performance of the existing ones. This paper proposes LoRA-Whisper, which incorporates LoRA matrix into Whisper for multilingual ASR, effectively mitigating language interference. Furthermore, by leveraging LoRA and the similarities between languages, we can achieve better performance on new languages while upholding consistent performance on original ones. Experiments on a real-world task across eight languages demonstrate that our proposed LoRA-Whisper yields a relative gain of 18.5% and 23.0% over the baseline system for multilingual ASR and language expansion respectively.

LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR

TL;DR

The paper tackles language interference and catastrophic forgetting in multilingual ASR during language expansion. It introduces LoRA-Whisper, which attaches language-specific LoRA modules to the Whisper backbone, keeping shared knowledge in Whisper while storing language-specific information in LoRA. It also proposes LoRA warm start and LoRA MoE to leverage cross-language similarity for efficient and effective expansion to new languages. On MLS and FLEURS across eight languages, LoRA-Whisper achieves substantial gains over baselines with only a small fraction of trainable parameters, demonstrating a practical, parameter-efficient path to customizable multilingual ASR. The work has implications for building scalable, extensible speech systems based on foundation models.

Abstract

Recent years have witnessed significant progress in multilingual automatic speech recognition (ASR), driven by the emergence of end-to-end (E2E) models and the scaling of multilingual datasets. Despite that, two main challenges persist in multilingual ASR: language interference and the incorporation of new languages without degrading the performance of the existing ones. This paper proposes LoRA-Whisper, which incorporates LoRA matrix into Whisper for multilingual ASR, effectively mitigating language interference. Furthermore, by leveraging LoRA and the similarities between languages, we can achieve better performance on new languages while upholding consistent performance on original ones. Experiments on a real-world task across eight languages demonstrate that our proposed LoRA-Whisper yields a relative gain of 18.5% and 23.0% over the baseline system for multilingual ASR and language expansion respectively.
Paper Structure (16 sections, 4 equations, 2 figures, 5 tables)

This paper contains 16 sections, 4 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Architecture of LoRA-Whisper in multilingual ASR
  • Figure 2: Architecture of LoRA-Whisper in language expansion. Left: LoRA warm start, Right: LoRA MoE