Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects
Orevaoghene Ahia, Anuoluwapo Aremu, Diana Abagyan, Hila Gonen, David Ifeoluwa Adelani, Daud Abolade, Noah A. Smith, Yulia Tsvetkov
TL;DR
This work addresses the neglect of Yorùbá dialects in NLP by creating YorùLect, a high-quality parallel text and speech corpus spanning four dialects and three domains (religious, news, Ted Talks). Through zero-shot and dialect-adaptive finetuning experiments on MT, ASR, and S2TT, the authors reveal substantial performance gaps between Standard Yorùbá and other dialects, and demonstrate meaningful gains when models are finetuned on dialect-specific data, including a $14$ BLEU MT gain and a $20$ WER decrease for ASR per dialect after adaptation. The study combines text localization, community-sourced speech data, and comprehensive evaluation (automatic and human), highlighting the challenges posed by dialectal variation and diacritic handling. By publicly releasing YorùLect and models, the work aims to advance equitable NLP tooling for Yorùbá and other African languages, emphasizing dialect-aware approaches as essential for practical deployment.
Abstract
Yorùbá an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus YORÙLECT across three domains and four regional Yorùbá dialects. To develop this corpus, we engaged native speakers, travelling to communities where these dialects are spoken, to collect text and speech data. Using our newly created corpus, we conducted extensive experiments on (text) machine translation, automatic speech recognition, and speech-to-text translation. Our results reveal substantial performance disparities between standard Yorùbá and the other dialects across all tasks. However, we also show that with dialect-adaptive finetuning, we are able to narrow this gap. We believe our dataset and experimental analysis will contribute greatly to developing NLP tools for Yorùbá and its dialects, and potentially for other African languages, by improving our understanding of existing challenges and offering a high-quality dataset for further development. We release YORÙLECT dataset and models publicly under an open license.
