ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis
Hawau Olamide Toyin, Rufael Marew, Humaid Alblooshi, Samar M. Magdy, Hanan Aldarmaki
TL;DR
ArVoice tackles the scarcity of high-quality, diacritized Modern Standard Arabic speech data by assembling a multispeaker corpus that blends carefully recorded human voices with synthetic data. It combines diacritized Part 1, non-diacritized Part 2, corrected Part 3 ASC data, and Part 4 synthetic speech to form a comprehensive 83.52-hour resource across 11 voices, including 7 human speakers. The study demonstrates baselines for multi-speaker TTS using ArVoice and shows that diacritics improve intelligibility, with VITS performing best among tested models, and that synthetic data augmentation boosts WER and subjective quality. It also explores parallel and non-parallel voice conversion using synthetic data, achieving competitive speaker similarity metrics. ArVoice is released under CC BY 4.0 for ASC and synthetic portions, with controlled access for professional recordings, offering a valuable benchmark for Arabic TTS, VC, and related tasks.
Abstract
We introduce ArVoice, a multi-speaker Modern Standard Arabic (MSA) speech corpus with diacritized transcriptions, intended for multi-speaker speech synthesis, and can be useful for other tasks such as speech-based diacritic restoration, voice conversion, and deepfake detection. ArVoice comprises: (1) a new professionally recorded set from six voice talents with diverse demographics, (2) a modified subset of the Arabic Speech Corpus; and (3) high-quality synthetic speech from two commercial systems. The complete corpus consists of a total of 83.52 hours of speech across 11 voices; around 10 hours consist of human voices from 7 speakers. We train three open-source TTS and two voice conversion systems to illustrate the use cases of the dataset. The corpus is available for research use.
