Table of Contents
Fetching ...

Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects

Orevaoghene Ahia, Anuoluwapo Aremu, Diana Abagyan, Hila Gonen, David Ifeoluwa Adelani, Daud Abolade, Noah A. Smith, Yulia Tsvetkov

TL;DR

This work addresses the neglect of Yorùbá dialects in NLP by creating YorùLect, a high-quality parallel text and speech corpus spanning four dialects and three domains (religious, news, Ted Talks). Through zero-shot and dialect-adaptive finetuning experiments on MT, ASR, and S2TT, the authors reveal substantial performance gaps between Standard Yorùbá and other dialects, and demonstrate meaningful gains when models are finetuned on dialect-specific data, including a $14$ BLEU MT gain and a $20$ WER decrease for ASR per dialect after adaptation. The study combines text localization, community-sourced speech data, and comprehensive evaluation (automatic and human), highlighting the challenges posed by dialectal variation and diacritic handling. By publicly releasing YorùLect and models, the work aims to advance equitable NLP tooling for Yorùbá and other African languages, emphasizing dialect-aware approaches as essential for practical deployment.

Abstract

Yorùbá an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus YORÙLECT across three domains and four regional Yorùbá dialects. To develop this corpus, we engaged native speakers, travelling to communities where these dialects are spoken, to collect text and speech data. Using our newly created corpus, we conducted extensive experiments on (text) machine translation, automatic speech recognition, and speech-to-text translation. Our results reveal substantial performance disparities between standard Yorùbá and the other dialects across all tasks. However, we also show that with dialect-adaptive finetuning, we are able to narrow this gap. We believe our dataset and experimental analysis will contribute greatly to developing NLP tools for Yorùbá and its dialects, and potentially for other African languages, by improving our understanding of existing challenges and offering a high-quality dataset for further development. We release YORÙLECT dataset and models publicly under an open license.

Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects

TL;DR

This work addresses the neglect of Yorùbá dialects in NLP by creating YorùLect, a high-quality parallel text and speech corpus spanning four dialects and three domains (religious, news, Ted Talks). Through zero-shot and dialect-adaptive finetuning experiments on MT, ASR, and S2TT, the authors reveal substantial performance gaps between Standard Yorùbá and other dialects, and demonstrate meaningful gains when models are finetuned on dialect-specific data, including a BLEU MT gain and a WER decrease for ASR per dialect after adaptation. The study combines text localization, community-sourced speech data, and comprehensive evaluation (automatic and human), highlighting the challenges posed by dialectal variation and diacritic handling. By publicly releasing YorùLect and models, the work aims to advance equitable NLP tooling for Yorùbá and other African languages, emphasizing dialect-aware approaches as essential for practical deployment.

Abstract

Yorùbá an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus YORÙLECT across three domains and four regional Yorùbá dialects. To develop this corpus, we engaged native speakers, travelling to communities where these dialects are spoken, to collect text and speech data. Using our newly created corpus, we conducted extensive experiments on (text) machine translation, automatic speech recognition, and speech-to-text translation. Our results reveal substantial performance disparities between standard Yorùbá and the other dialects across all tasks. However, we also show that with dialect-adaptive finetuning, we are able to narrow this gap. We believe our dataset and experimental analysis will contribute greatly to developing NLP tools for Yorùbá and its dialects, and potentially for other African languages, by improving our understanding of existing challenges and offering a high-quality dataset for further development. We release YORÙLECT dataset and models publicly under an open license.
Paper Structure (33 sections, 4 figures, 12 tables)

This paper contains 33 sections, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Geographical distribution of Yorùbá dialects in West Africa. Map from ozburn2023language.
  • Figure 2: MT results $(\uparrow)$. We compare BLEU across Google Translate, NLLB prior to finetuning, and NLLB after finetuning.
  • Figure 3: ASR results. $(\downarrow)$ We compare WER between zero-shot and jointly fine-tuning on all dialects on XLSR and MMS models.
  • Figure 4: S2TT results $(\uparrow)$. We compare BLEU prior to finetuning and after finetuning SeamlessM4T.