Table of Contents
Fetching ...

Towards scalable efficient on-device ASR with transfer learning

Laxmi Pandey, Ke Li, Jinxi Guo, Debjyoti Paul, Arthur Guo, Jay Mahadeokar, Xuedong Zhang

TL;DR

The paper tackles scalable, efficient on-device ASR for low-resource languages by systematically evaluating multilingual pretraining and transfer learning. It compares transferring at the RNNT training stage versus MinWER finetuning, and contrasts in-domain versus out-of-domain pretraining, investigating effects on rare-word recognition and zero-shot languages. Key findings show that pretraining during the RNNT stage followed by MinWER finetuning yields substantial WER reductions (e.g., 36.2% for MLS and 42.8% for in-house baselines), with out-of-domain pretraining generally providing larger gains and rare words benefiting more from out-of-domain transfer. The study also demonstrates dramatic improvements in training efficiency and convergence, and identifies nuanced language-specific effects in zero-shot scenarios, offering practical guidance for deploying multilingual transfer learning in on-device ASR systems.

Abstract

Multilingual pretraining for transfer learning significantly boosts the robustness of low-resource monolingual ASR models. This study systematically investigates three main aspects: (a) the impact of transfer learning on model performance during initial training or fine-tuning, (b) the influence of transfer learning across dataset domains and languages, and (c) the effect on rare-word recognition compared to non-rare words. Our finding suggests that RNNT-loss pretraining, followed by monolingual fine-tuning with Minimum Word Error Rate (MinWER) loss, consistently reduces Word Error Rates (WER) across languages like Italian and French. WER Reductions (WERR) reach 36.2% and 42.8% compared to monolingual baselines for MLS and in-house datasets. Out-of-domain pretraining leads to 28% higher WERR than in-domain pretraining. Both rare and non-rare words benefit, with rare words showing greater improvements with out-of-domain pretraining, and non-rare words with in-domain pretraining.

Towards scalable efficient on-device ASR with transfer learning

TL;DR

The paper tackles scalable, efficient on-device ASR for low-resource languages by systematically evaluating multilingual pretraining and transfer learning. It compares transferring at the RNNT training stage versus MinWER finetuning, and contrasts in-domain versus out-of-domain pretraining, investigating effects on rare-word recognition and zero-shot languages. Key findings show that pretraining during the RNNT stage followed by MinWER finetuning yields substantial WER reductions (e.g., 36.2% for MLS and 42.8% for in-house baselines), with out-of-domain pretraining generally providing larger gains and rare words benefiting more from out-of-domain transfer. The study also demonstrates dramatic improvements in training efficiency and convergence, and identifies nuanced language-specific effects in zero-shot scenarios, offering practical guidance for deploying multilingual transfer learning in on-device ASR systems.

Abstract

Multilingual pretraining for transfer learning significantly boosts the robustness of low-resource monolingual ASR models. This study systematically investigates three main aspects: (a) the impact of transfer learning on model performance during initial training or fine-tuning, (b) the influence of transfer learning across dataset domains and languages, and (c) the effect on rare-word recognition compared to non-rare words. Our finding suggests that RNNT-loss pretraining, followed by monolingual fine-tuning with Minimum Word Error Rate (MinWER) loss, consistently reduces Word Error Rates (WER) across languages like Italian and French. WER Reductions (WERR) reach 36.2% and 42.8% compared to monolingual baselines for MLS and in-house datasets. Out-of-domain pretraining leads to 28% higher WERR than in-domain pretraining. Both rare and non-rare words benefit, with rare words showing greater improvements with out-of-domain pretraining, and non-rare words with in-domain pretraining.
Paper Structure (17 sections, 2 equations, 1 figure, 4 tables)

This paper contains 17 sections, 2 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Visual representation of our multilingual pretraining strategy, showcasing both in-domain (MLS:seed | MLS:target) and out-of-domain (In-house:seed | MLS:target) approaches, alongside the ASR model training architecture with Alignment Restricted RNNT (AR-RNNT) and MinWER loss function.