Table of Contents
Fetching ...

Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM based Multilingual Speech Recognition

Yuxiang Mei, Delai Qiu, Shengping Liu, Jiaen Liang, Yanhua Long

Abstract

Speech Large Language Models (Speech-LLMs) have emerged as a powerful approach for automatic speech recognition (ASR) by aligning speech encoders with large language models. However, adapting these systems to multilingual settings with imbalanced data distributions remains challenging. In such scenarios, a stability-plasticity dilemma often arises: fully shared Parameter-Efficient Fine-Tuning (PEFT) can cause negative inter-lingual interference for under-represented languages, while fully language-specific tuning limits the cross-lingual beneficial knowledge transfer needed for low-resource tasks. To address this, we propose Zipper-LoRA, a novel rank-level decoupling framework with three variants (Static, Hard, and Soft) that dynamically synthesizes LoRA updates from shared and language-specific subspaces. By using a lightweight language-conditioned router, Zipper-LoRA dynamically controls the contribution of each subspace at the LoRA rank level, enabling fine-grained sharing where languages are compatible and strict decoupling when conflicts occur. To further stabilize optimization under imbalanced data, we propose a two-stage training strategy with an Initial-B warm start that significantly accelerates convergence. Experiments on a 12-language mixed-resource setting show that Zipper-LoRA consistently outperforms both fully shared and independent baselines, particularly in extremely low-resource scenarios. Moreover, we demonstrate that these gains are robust across both chunked and non-chunked encoder configurations, confirming the framework's reliability for practical, large-scale multilingual ASR. Our code and data will be available at https://github.com/YuCeong-May/Zipper-LoRA for reproducibility.

Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM based Multilingual Speech Recognition

Abstract

Speech Large Language Models (Speech-LLMs) have emerged as a powerful approach for automatic speech recognition (ASR) by aligning speech encoders with large language models. However, adapting these systems to multilingual settings with imbalanced data distributions remains challenging. In such scenarios, a stability-plasticity dilemma often arises: fully shared Parameter-Efficient Fine-Tuning (PEFT) can cause negative inter-lingual interference for under-represented languages, while fully language-specific tuning limits the cross-lingual beneficial knowledge transfer needed for low-resource tasks. To address this, we propose Zipper-LoRA, a novel rank-level decoupling framework with three variants (Static, Hard, and Soft) that dynamically synthesizes LoRA updates from shared and language-specific subspaces. By using a lightweight language-conditioned router, Zipper-LoRA dynamically controls the contribution of each subspace at the LoRA rank level, enabling fine-grained sharing where languages are compatible and strict decoupling when conflicts occur. To further stabilize optimization under imbalanced data, we propose a two-stage training strategy with an Initial-B warm start that significantly accelerates convergence. Experiments on a 12-language mixed-resource setting show that Zipper-LoRA consistently outperforms both fully shared and independent baselines, particularly in extremely low-resource scenarios. Moreover, we demonstrate that these gains are robust across both chunked and non-chunked encoder configurations, confirming the framework's reliability for practical, large-scale multilingual ASR. Our code and data will be available at https://github.com/YuCeong-May/Zipper-LoRA for reproducibility.
Paper Structure (27 sections, 18 equations, 8 figures, 6 tables)

This paper contains 27 sections, 18 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Overall Speech-LLM backbone consisting of a speech encoder, a modality projector, and a decoder-only LLM.
  • Figure 2: Language specific prompts. All these prompts have the same meaning: "Please transcribe the audio content into text." but are written in specific languages based on the language given for a speech.
  • Figure 3: Illustration of three representative PEFT frameworks for multilingual ASR Speech-LLM adaptation: Vanilla-LoRA (a), Independent-LoRA (b), and FlyLoRA (c).
  • Figure 4: Overview of the proposed Zipper-LoRA. A language-aware router outputs rank-wise mixing weights from language embeddings to construct $B_{\text{merged}}^{(l)}$ for multilingual ASR adaptation.
  • Figure 5: Performance comparison of Zipper-LoRA-Soft (+ initial-B) and other LoRA-based methods on the 12 target languages in the SFT stage under the non-chunked setting, using $(1-\mathrm{WER/CER})\%$ as the metric. Detailed numerical results are provided in Tables \ref{['tab:wer_high_nonchunk']} and \ref{['tab:wer_low_cv']}.
  • ...and 3 more figures