Table of Contents
Fetching ...

Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages

Xabier de Zuazo, Eva Navas, Ibon Saratxaga, Inma Hernáez Rioja

TL;DR

This paper addresses improving ASR for low-resource languages by augmenting Whisper with both traditional n-gram LMs and large language models, evaluated across Basque, Galician, Catalan, and Spanish. The authors fine-tune Whisper on Common Voice v13.0, then fuse external language priors at inference using a rescoring scheme, optimizing parameters with Bayesian methods, and evaluating robustness via ID and OOD datasets. Key contributions include introducing the ERER robustness metric, a comprehensive leakage analysis, and an extensive ablation of evaluation parameters, showing up to 76% WER reductions from fine-tuning (ID Basque) and up to 51% additional gains with 5-gram LMs, plus robust improvements from LLMs across languages. The work provides practical guidance on parameter tuning, demonstrates the complementary strengths of traditional LMs and LLMs, and makes the code openly available to advance inclusive, multilingual ASR systems.

Abstract

Automatic speech recognition systems have undoubtedly advanced with the integration of multilingual and multitask models such as Whisper, which have shown a promising ability to understand and process speech across a wide range of languages. Despite their robustness, these models often fall short in handling the linguistic distinctions of minority languages. This study addresses this gap by integrating traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages. Through rigorous fine-tuning and evaluation across multiple datasets, we demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. Our approach not only does take advantage of the extensive data Whisper was pre-trained on, but also complements its linguistic adaptability by incorporating language models. We obtained improvements up to 51% for in-distribution datasets and up to 34% for out-of-distribution sentences using statistical language models, while large language models provided moderate but consistently robust improvement across diverse linguistic contexts. The findings reveal that, while the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters. Finally, we emphasize the importance of selecting appropriate evaluation parameters when reporting the results using transformer-based ASR models. In summary, this research clears the way for more inclusive ASR technologies that perform better across languages by enriching their linguistic knowledge. For further implementation details of this study, the technical documentation and source code are available at http://www.github.com/hitz-zentroa/whisper-lm.

Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages

TL;DR

This paper addresses improving ASR for low-resource languages by augmenting Whisper with both traditional n-gram LMs and large language models, evaluated across Basque, Galician, Catalan, and Spanish. The authors fine-tune Whisper on Common Voice v13.0, then fuse external language priors at inference using a rescoring scheme, optimizing parameters with Bayesian methods, and evaluating robustness via ID and OOD datasets. Key contributions include introducing the ERER robustness metric, a comprehensive leakage analysis, and an extensive ablation of evaluation parameters, showing up to 76% WER reductions from fine-tuning (ID Basque) and up to 51% additional gains with 5-gram LMs, plus robust improvements from LLMs across languages. The work provides practical guidance on parameter tuning, demonstrates the complementary strengths of traditional LMs and LLMs, and makes the code openly available to advance inclusive, multilingual ASR systems.

Abstract

Automatic speech recognition systems have undoubtedly advanced with the integration of multilingual and multitask models such as Whisper, which have shown a promising ability to understand and process speech across a wide range of languages. Despite their robustness, these models often fall short in handling the linguistic distinctions of minority languages. This study addresses this gap by integrating traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages. Through rigorous fine-tuning and evaluation across multiple datasets, we demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. Our approach not only does take advantage of the extensive data Whisper was pre-trained on, but also complements its linguistic adaptability by incorporating language models. We obtained improvements up to 51% for in-distribution datasets and up to 34% for out-of-distribution sentences using statistical language models, while large language models provided moderate but consistently robust improvement across diverse linguistic contexts. The findings reveal that, while the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters. Finally, we emphasize the importance of selecting appropriate evaluation parameters when reporting the results using transformer-based ASR models. In summary, this research clears the way for more inclusive ASR technologies that perform better across languages by enriching their linguistic knowledge. For further implementation details of this study, the technical documentation and source code are available at http://www.github.com/hitz-zentroa/whisper-lm.

Paper Structure

This paper contains 32 sections, 6 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: Distribution of dataset hours across different training phases.
  • Figure 2: Effective robustness of RER by model size.
  • Figure 3: Effective robustness of RER by language.
  • Figure 4: The averaged RER across different model sizes to study the impact of various evaluation parameters on the WER. Negative values indicate performance decreases when changing from our selected baseline.
  • Figure 5: LM optimization trials with better scores being more opaque.
  • ...and 1 more figures