Table of Contents
Fetching ...

ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis

Hawau Olamide Toyin, Rufael Marew, Humaid Alblooshi, Samar M. Magdy, Hanan Aldarmaki

TL;DR

ArVoice tackles the scarcity of high-quality, diacritized Modern Standard Arabic speech data by assembling a multispeaker corpus that blends carefully recorded human voices with synthetic data. It combines diacritized Part 1, non-diacritized Part 2, corrected Part 3 ASC data, and Part 4 synthetic speech to form a comprehensive 83.52-hour resource across 11 voices, including 7 human speakers. The study demonstrates baselines for multi-speaker TTS using ArVoice and shows that diacritics improve intelligibility, with VITS performing best among tested models, and that synthetic data augmentation boosts WER and subjective quality. It also explores parallel and non-parallel voice conversion using synthetic data, achieving competitive speaker similarity metrics. ArVoice is released under CC BY 4.0 for ASC and synthetic portions, with controlled access for professional recordings, offering a valuable benchmark for Arabic TTS, VC, and related tasks.

Abstract

We introduce ArVoice, a multi-speaker Modern Standard Arabic (MSA) speech corpus with diacritized transcriptions, intended for multi-speaker speech synthesis, and can be useful for other tasks such as speech-based diacritic restoration, voice conversion, and deepfake detection. ArVoice comprises: (1) a new professionally recorded set from six voice talents with diverse demographics, (2) a modified subset of the Arabic Speech Corpus; and (3) high-quality synthetic speech from two commercial systems. The complete corpus consists of a total of 83.52 hours of speech across 11 voices; around 10 hours consist of human voices from 7 speakers. We train three open-source TTS and two voice conversion systems to illustrate the use cases of the dataset. The corpus is available for research use.

ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis

TL;DR

ArVoice tackles the scarcity of high-quality, diacritized Modern Standard Arabic speech data by assembling a multispeaker corpus that blends carefully recorded human voices with synthetic data. It combines diacritized Part 1, non-diacritized Part 2, corrected Part 3 ASC data, and Part 4 synthetic speech to form a comprehensive 83.52-hour resource across 11 voices, including 7 human speakers. The study demonstrates baselines for multi-speaker TTS using ArVoice and shows that diacritics improve intelligibility, with VITS performing best among tested models, and that synthetic data augmentation boosts WER and subjective quality. It also explores parallel and non-parallel voice conversion using synthetic data, achieving competitive speaker similarity metrics. ArVoice is released under CC BY 4.0 for ASC and synthetic portions, with controlled access for professional recordings, offering a valuable benchmark for Arabic TTS, VC, and related tasks.

Abstract

We introduce ArVoice, a multi-speaker Modern Standard Arabic (MSA) speech corpus with diacritized transcriptions, intended for multi-speaker speech synthesis, and can be useful for other tasks such as speech-based diacritic restoration, voice conversion, and deepfake detection. ArVoice comprises: (1) a new professionally recorded set from six voice talents with diverse demographics, (2) a modified subset of the Arabic Speech Corpus; and (3) high-quality synthetic speech from two commercial systems. The complete corpus consists of a total of 83.52 hours of speech across 11 voices; around 10 hours consist of human voices from 7 speakers. We train three open-source TTS and two voice conversion systems to illustrate the use cases of the dataset. The corpus is available for research use.

Paper Structure

This paper contains 13 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Mean average preference with/without diacritics for each TTS model, with 95% confidence intervals.
  • Figure 2: Mean average preference with 95% confidence intervals. We compare (a) the best version of VITS against. the best version of ArTST, and (b) Multi-speaker vs. single-speaker TTS with the VITS model.