Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language
Turi Abu, Ying Shi, Thomas Fang Zheng, Dong Wang
TL;DR
Sagalee addresses the scarcity of Oromo ASR data by introducing a crowd-sourced open dataset totaling 100 hours of read speech from 283 speakers. The authors establish baseline models using a Conformer trained from scratch with a hybrid CTC-AED loss and pure CTC loss, and show major gains from fine-tuning a large multilingual model (Whisper), achieving 10.82% WER. The dataset and baselines demonstrate the viability of modern ASR methods for Oromo and provide a resource to spur further research in low-resource language speech processing. The work highlights both challenges and potential for improving Oromo ASR and outlines plans for dataset expansion and application to other speech tasks.
Abstract
We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. The dataset was collected through a crowd-sourcing initiative, encompassing a diverse range of speakers and phonetic variations. It consists of 100 hours of real-world audio recordings paired with transcriptions, covering read speech in both clean and noisy environments. This dataset addresses the critical need for ASR resources for the Oromo language which is underrepresented. To show its applicability for the ASR task, we conducted experiments using the Conformer model, achieving a Word Error Rate (WER) of 15.32% with hybrid CTC and AED loss and WER of 18.74% with pure CTC loss. Additionally, fine-tuning the Whisper model resulted in a significantly improved WER of 10.82%. These results establish baselines for Oromo ASR, highlighting both the challenges and the potential for improving ASR performance in Oromo. The dataset is publicly available at https://github.com/turinaf/sagalee and we encourage its use for further research and development in Oromo speech processing.
