Table of Contents
Fetching ...

Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR

Junwen Bai, Bo Li, Qiujia Li, Tara N. Sainath, Trevor Strohman

TL;DR

This work tackles streaming multilingual ASR for tail languages where data is scarce and privacy concerns limit data sharing. It introduces Language-Dependent Adapters (LDA) perched on a frozen foundation Conformer backbone within a cascaded transducer, trained with Noisy Student Training and a checkpoint-merging strategy to assemble per-language peaks into a single deployable model. The approach achieves an average $12.2\%$ relative WER reduction across 39 tail languages, with individual locales such as Slovak reaching $37.5\%$, and can match full-model finetuning with only $0.4\%$ per-language parameters, significantly reducing training and deployment burdens. Overall, LDA enables efficient, high-quality streaming MASR across many languages under data privacy constraints and variable data availability, with practical implications for scalable multilingual acoustic modeling.

Abstract

The end-to-end ASR model is often desired in the streaming multilingual scenario since it is easier to deploy and can benefit from pre-trained speech models such as powerful foundation models. Meanwhile, the heterogeneous nature and imbalanced data abundance of different languages may cause performance degradation, leading to asynchronous peak performance for different languages during training, especially on tail ones. Sometimes even the data itself may become unavailable as a result of the enhanced privacy protection. Existing work tend to significantly increase the model size or learn language-specific decoders to accommodate each language separately. In this study, we explore simple yet effective Language-Dependent Adapter (LDA) finetuning under a cascaded Conformer transducer framework enhanced by teacher pseudo-labeling for tail languages in the streaming multilingual ASR. The adapter only accounts for 0.4% of the full model per language. It is plugged into the frozen foundation model and is the only trainable module during the finetuning process with noisy student training. The final model merges the adapter parameters from different checkpoints for different languages. The model performance is validated on a challenging multilingual dictation dataset, which includes 39 tail languages across Latin, Greek, Arabic, etc. Our proposed method brings 12.2% word error rate reduction on average and up to 37.5% on a single locale. Furthermore, we show that our parameter-efficient LDA can match the quality of the full model finetuning, thus greatly alleviating the asynchronous peak performance issue.

Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR

TL;DR

This work tackles streaming multilingual ASR for tail languages where data is scarce and privacy concerns limit data sharing. It introduces Language-Dependent Adapters (LDA) perched on a frozen foundation Conformer backbone within a cascaded transducer, trained with Noisy Student Training and a checkpoint-merging strategy to assemble per-language peaks into a single deployable model. The approach achieves an average relative WER reduction across 39 tail languages, with individual locales such as Slovak reaching , and can match full-model finetuning with only per-language parameters, significantly reducing training and deployment burdens. Overall, LDA enables efficient, high-quality streaming MASR across many languages under data privacy constraints and variable data availability, with practical implications for scalable multilingual acoustic modeling.

Abstract

The end-to-end ASR model is often desired in the streaming multilingual scenario since it is easier to deploy and can benefit from pre-trained speech models such as powerful foundation models. Meanwhile, the heterogeneous nature and imbalanced data abundance of different languages may cause performance degradation, leading to asynchronous peak performance for different languages during training, especially on tail ones. Sometimes even the data itself may become unavailable as a result of the enhanced privacy protection. Existing work tend to significantly increase the model size or learn language-specific decoders to accommodate each language separately. In this study, we explore simple yet effective Language-Dependent Adapter (LDA) finetuning under a cascaded Conformer transducer framework enhanced by teacher pseudo-labeling for tail languages in the streaming multilingual ASR. The adapter only accounts for 0.4% of the full model per language. It is plugged into the frozen foundation model and is the only trainable module during the finetuning process with noisy student training. The final model merges the adapter parameters from different checkpoints for different languages. The model performance is validated on a challenging multilingual dictation dataset, which includes 39 tail languages across Latin, Greek, Arabic, etc. Our proposed method brings 12.2% word error rate reduction on average and up to 37.5% on a single locale. Furthermore, we show that our parameter-efficient LDA can match the quality of the full model finetuning, thus greatly alleviating the asynchronous peak performance issue.
Paper Structure (10 sections, 1 equation, 3 figures)

This paper contains 10 sections, 1 equation, 3 figures.

Figures (3)

  • Figure 1: An overview of LDA in a Conformer model with cascaded encoders. LDAs are inserted between two consecutive Conformer layers for both 1st and 2nd passes. Each LDA module contains a stack of language-dependent parameters.
  • Figure 2: The improvements brought by our method compared to the baseline which is also the existing launched model on the dictation dataset. The blue bars demonstrate the WERs on each language given by our model. The yellow bars highlight the WER reduction outperforming the baseline. The combination of yellow and blue bars denotes the baseline WERs. As shown in the figure, we can achieve significant gains on most languages. On Slovak, the gain can reach up to 37.5%. On average, the improvement on all locales is 12.2%.
  • Figure 3: We further compare our model with the full model finetuning. The yellow bars are the same as Fig. \ref{['fig:over_base']}. The green bars represent the gap between our model and updating all the parameters for individual locales. As shown in the figure, for most locales, our LDA's performance is on par with the full model finetuning, while ours only updates a small portion of all the parameters. Even on other languages like Czech, Hebrew, the yellow bars outweigh the green bars. The blue bars demonstrate our improvements over the baselines trained with supervised data only, proving the contributions from NST.