KinSPEAK: Improving speech recognition for Kinyarwanda via semi-supervised learning methods
Antoine Nzeyimana
TL;DR
This work tackles robust ASR for a low-resource language, Kinyarwanda, by integrating self-supervised pre-training on Kinyarwanda-only data, a multi-stage curriculum fine-tuning regime, and semi-supervised learning to exploit large unlabelled datasets, all using public-domain sources. It introduces a studio-quality JW.ORG-based corpus for clean supervision, and compares syllable- versus character-based tokenization, finding syllables superior. Across five semi-supervised generations, the approach achieves 3.2% WER on JW.ORG and 15.6% WER on Mozilla Common Voice, with 1.0% CER on JW.ORG, indicating strong performance given the language's morphology and open vocabulary. The methods and dataset collection strategy offer a transferable blueprint for improving ASR in other low-resource languages, with potential extensions to mobile deployment and language technology applications like translation and information retrieval.
Abstract
Despite recent availability of large transcribed Kinyarwanda speech data, achieving robust speech recognition for Kinyarwanda is still challenging. In this work, we show that using self-supervised pre-training, following a simple curriculum schedule during fine-tuning and using semi-supervised learning to leverage large unlabelled speech data significantly improve speech recognition performance for Kinyarwanda. Our approach focuses on using public domain data only. A new studio-quality speech dataset is collected from a public website, then used to train a clean baseline model. The clean baseline model is then used to rank examples from a more diverse and noisy public dataset, defining a simple curriculum training schedule. Finally, we apply semi-supervised learning to label and learn from large unlabelled data in five successive generations. Our final model achieves 3.2% word error rate (WER) on the new dataset and 15.6% WER on Mozilla Common Voice benchmark, which is state-of-the-art to the best of our knowledge. Our experiments also indicate that using syllabic rather than character-based tokenization results in better speech recognition performance for Kinyarwanda.
