Table of Contents
Fetching ...

ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus

Tolulope Ogunremi, Kola Tubosun, Anuoluwapo Aremu, Iroro Orife, David Ifeoluwa Adelani

TL;DR

ÌròyìnSpeech addresses the scarcity of high-quality Yorùbá speech data for TTS and ASR by assembling a multi-domain corpus (~23k sentences, ~42 hours) recorded by 80 volunteers plus an additional ~6 hours via Mozilla Common Voice under CC-BY-4.0. The authors provide extensive baselines for TTS and ASR using state-of-the-art approaches (VITS, Conformer, wav2vec 2.0) and investigate the impact of diacritics, pre-training transfer, and data size on model quality. Key findings show diacritics improve perceived naturalness in TTS, that continued pre-training yields mixed gains, and that a trigram language model substantially lowers ASR WER to 23.8 on a Yorùbá baseline. The open release of the dataset and accompanying tools is poised to accelerate Yorùbá speech research and broader language-technology development for the community.

Abstract

We introduce ÌròyìnSpeech, a new corpus influenced by the desire to increase the amount of high quality, contemporary Yorùbá speech data, which can be used for both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) tasks. We curated about 23000 text sentences from news and creative writing domains with the open license CC-BY-4.0. To encourage a participatory approach to data creation, we provide 5000 curated sentences to the Mozilla Common Voice platform to crowd-source the recording and validation of Yorùbá speech data. In total, we created about 42 hours of speech data recorded by 80 volunteers in-house, and 6 hours of validated recordings on Mozilla Common Voice platform. Our TTS evaluation suggests that a high-fidelity, general domain, single-speaker Yorùbá voice is possible with as little as 5 hours of speech. Similarly, for ASR we obtained a baseline word error rate (WER) of 23.8.

ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus

TL;DR

ÌròyìnSpeech addresses the scarcity of high-quality Yorùbá speech data for TTS and ASR by assembling a multi-domain corpus (~23k sentences, ~42 hours) recorded by 80 volunteers plus an additional ~6 hours via Mozilla Common Voice under CC-BY-4.0. The authors provide extensive baselines for TTS and ASR using state-of-the-art approaches (VITS, Conformer, wav2vec 2.0) and investigate the impact of diacritics, pre-training transfer, and data size on model quality. Key findings show diacritics improve perceived naturalness in TTS, that continued pre-training yields mixed gains, and that a trigram language model substantially lowers ASR WER to 23.8 on a Yorùbá baseline. The open release of the dataset and accompanying tools is poised to accelerate Yorùbá speech research and broader language-technology development for the community.

Abstract

We introduce ÌròyìnSpeech, a new corpus influenced by the desire to increase the amount of high quality, contemporary Yorùbá speech data, which can be used for both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) tasks. We curated about 23000 text sentences from news and creative writing domains with the open license CC-BY-4.0. To encourage a participatory approach to data creation, we provide 5000 curated sentences to the Mozilla Common Voice platform to crowd-source the recording and validation of Yorùbá speech data. In total, we created about 42 hours of speech data recorded by 80 volunteers in-house, and 6 hours of validated recordings on Mozilla Common Voice platform. Our TTS evaluation suggests that a high-fidelity, general domain, single-speaker Yorùbá voice is possible with as little as 5 hours of speech. Similarly, for ASR we obtained a baseline word error rate (WER) of 23.8.
Paper Structure (23 sections, 5 tables)