KinSPEAK: Improving speech recognition for Kinyarwanda via semi-supervised learning methods

Antoine Nzeyimana

KinSPEAK: Improving speech recognition for Kinyarwanda via semi-supervised learning methods

Antoine Nzeyimana

TL;DR

This work tackles robust ASR for a low-resource language, Kinyarwanda, by integrating self-supervised pre-training on Kinyarwanda-only data, a multi-stage curriculum fine-tuning regime, and semi-supervised learning to exploit large unlabelled datasets, all using public-domain sources. It introduces a studio-quality JW.ORG-based corpus for clean supervision, and compares syllable- versus character-based tokenization, finding syllables superior. Across five semi-supervised generations, the approach achieves 3.2% WER on JW.ORG and 15.6% WER on Mozilla Common Voice, with 1.0% CER on JW.ORG, indicating strong performance given the language's morphology and open vocabulary. The methods and dataset collection strategy offer a transferable blueprint for improving ASR in other low-resource languages, with potential extensions to mobile deployment and language technology applications like translation and information retrieval.

Abstract

Despite recent availability of large transcribed Kinyarwanda speech data, achieving robust speech recognition for Kinyarwanda is still challenging. In this work, we show that using self-supervised pre-training, following a simple curriculum schedule during fine-tuning and using semi-supervised learning to leverage large unlabelled speech data significantly improve speech recognition performance for Kinyarwanda. Our approach focuses on using public domain data only. A new studio-quality speech dataset is collected from a public website, then used to train a clean baseline model. The clean baseline model is then used to rank examples from a more diverse and noisy public dataset, defining a simple curriculum training schedule. Finally, we apply semi-supervised learning to label and learn from large unlabelled data in five successive generations. Our final model achieves 3.2% word error rate (WER) on the new dataset and 15.6% WER on Mozilla Common Voice benchmark, which is state-of-the-art to the best of our knowledge. Our experiments also indicate that using syllabic rather than character-based tokenization results in better speech recognition performance for Kinyarwanda.

KinSPEAK: Improving speech recognition for Kinyarwanda via semi-supervised learning methods

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 2 figures, 5 tables)

This paper contains 17 sections, 1 equation, 2 figures, 5 tables.

Introduction
Related work
Methods
Collecting studio-quality transcribed utterances via speech-text alignment
ASR Model architecture
Multi-staged curriculum schedule for training
Syllable-based tokenization
Experimental setup
JW.ORG speech data gathering and text-speech alignment
ASR model implementation
Training process
Evaluation
Results and discussion
Effects of tokenization, self-supervised pre-training and curriculum learning
Semi-supervised learning results
...and 2 more sections

Figures (2)

Figure 1: Speech-text alignment mobile interface. The annotator is asked to touch the last word played by the audio clip before the pause. The selected segment is highlighted and then cut out and the process repeats until the end of the document.
Figure 2: ASR Model architecture

KinSPEAK: Improving speech recognition for Kinyarwanda via semi-supervised learning methods

TL;DR

Abstract

KinSPEAK: Improving speech recognition for Kinyarwanda via semi-supervised learning methods

Authors

TL;DR

Abstract

Table of Contents

Figures (2)