Mai Ho'omāuna i ka 'Ai: Language Models Improve Automatic Speech Recognition in Hawaiian

Kaavya Chaparala; Guido Zarrella; Bruce Torres Fischer; Larry Kimura; Oiwi Parker Jones

Mai Ho'omāuna i ka 'Ai: Language Models Improve Automatic Speech Recognition in Hawaiian

Kaavya Chaparala, Guido Zarrella, Bruce Torres Fischer, Larry Kimura, Oiwi Parker Jones

TL;DR

This study tackles the challenge of improving ASR for Hawaiian, a low-resource language, by evaluating zero-shot transfer of the Whisper foundation model and by augmenting it with an external Hawaiian LM trained on ~1.5M words. The authors replicate a state-of-the-art Hawaiian LM, integrate it via rescoring in a principled way, and assess performance on a carefully curated Hawaiian test set derived from Ka Leo Hawai'i. They show a small but statistically significant WER improvement (about 1–2%) when rescoring Whisper with the Hawaiian LM, with the strongest gains observed for the large-v2 model. The work demonstrates the value of leveraging available text data to enhance ASR for underrepresented languages and points to scalable directions such as larger LMs, fine-tuning, and self-supervised techniques to further close the gap to high-resource languages.

Abstract

In this paper we address the challenge of improving Automatic Speech Recognition (ASR) for a low-resource language, Hawaiian, by incorporating large amounts of independent text data into an ASR foundation model, Whisper. To do this, we train an external language model (LM) on ~1.5M words of Hawaiian text. We then use the LM to rescore Whisper and compute word error rates (WERs) on a manually curated test set of labeled Hawaiian data. As a baseline, we use Whisper without an external LM. Experimental results reveal a small but significant improvement in WER when ASR outputs are rescored with a Hawaiian LM. The results support leveraging all available data in the development of ASR systems for underrepresented languages.

Mai Ho'omāuna i ka 'Ai: Language Models Improve Automatic Speech Recognition in Hawaiian

TL;DR

Abstract

Paper Structure (11 sections, 1 equation, 4 figures, 1 table)

This paper contains 11 sections, 1 equation, 4 figures, 1 table.

Introduction
Methods
Hawaiian LMs
Rescoring Whisper
ASR Test Set
Experiment Results
Which Whisper model transfers best to Hawaiian?
Does rescoring improve the best Whisper model?
Does it matter how much text the LM is trained on?
Discussion
Acknowledgments

Figures (4)

Figure 1: Large ASR models produce the lowest word error rates (WERs) for Hawaiian test data. Left panel: We compared six Whisper models on Hawaiian using zero-shot transfer without a Hawaiian language model (LM), as a baseline for comparing ASR models with LMs. Asterisks indicate the best models, large and large-v2. Right panel: No statistical difference in WER was observed between large and large-v2 ($t_{3.309} = 0.002, p = 0.999$, Welch's t-test). Error bars show standard error of the mean.
Figure 2: An overview of zero-shot transfer in Hawaiian using Whisper models without LMs (gray bars) and with LMs (blue bars). The $\alpha$ values in the x-axes show the weighting of Hawaiian LMs against the ASR predictions (see text for details). Each panel presents results for different Whisper model. The difference in the best models (e.g. large-v2) is small and hard to appreciate with the y-limits fixed at this scale. For large-v2 results (bottom-right panel) with re-scaled y-limits, please see Figure \ref{['fig2:rescore_large-v2']}.
Figure 3: Rescoring with a Hawaiian LM provides a small but significant improvement on the zero-shot Whisper baseline. Rescoring results for large-v2. The $\alpha$ values weight the contribution of the LM. $\alpha=0$ means no contribution of the LM (baseline model). Other values add increasing weight to the LM. The best WER was found at $\alpha=0.25$ where we observe a small but significant improvement on the baseline model ($t_2 = 19.498, p = 0.003$, one-sample t-test).
Figure 4: Posthoc exploration on the amount of training text, LM validation perplexity, and Whisper WER. Hawaiian LMs were trained on decreasing fractions of data: 1/2 (purple), 1/4 (orange), 1/8 (green), and 1/16 (red). See text for details.

Mai Ho'omāuna i ka 'Ai: Language Models Improve Automatic Speech Recognition in Hawaiian

TL;DR

Abstract

Mai Ho'omāuna i ka 'Ai: Language Models Improve Automatic Speech Recognition in Hawaiian

Authors

TL;DR

Abstract

Table of Contents

Figures (4)