Table of Contents
Fetching ...

Towards Zero-Shot Text-To-Speech for Arabic Dialects

Khai Duy Doan, Abdul Waheed, Muhammad Abdul-Mageed

TL;DR

The paper tackles zero-shot multi-speaker TTS for Arabic, addressing resource scarcity by repurposing the QASR dataset and integrating dialect labels. It fine-tunes the open-source XTTS model on a refined, dialect-annotated Arabic corpus and evaluates on 31 unseen speakers plus an in-house dialect dataset, using SECS, WER, and human judgments. Key findings show that baseline XTTS yields lower WER, while dialect-token fine-tuning improves speaker-consistency and dialect fidelity on in-house data, with human evaluators confirming naturalness comparable to supervised baselines. The work demonstrates the feasibility of Arabic ZS-TTS with dialect-aware conditioning and provides a pathway for broader dialect coverage.

Abstract

Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources. We address this gap for Arabic, a language of more than 450 million native speakers, by first adapting a sizeable existing dataset to suit the needs of speech synthesis. Additionally, we employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting. Subsequently, we fine-tune the XTTS\footnote{https://docs.coqui.ai/en/latest/models/xtts.html}\footnote{https://medium.com/machine-learns/xtts-v2-new-version-of-the-open-source-text-to-speech-model-af73914db81f}\footnote{https://medium.com/@erogol/xtts-v1-techincal-notes-eb83ff05bdc} model, an open-source architecture. We then evaluate our models on a dataset comprising 31 unseen speakers and an in-house dialectal dataset. Our automated and human evaluation results show convincing performance while capable of generating dialectal speech. Our study highlights significant potential for improvements in this emerging area of research in Arabic.

Towards Zero-Shot Text-To-Speech for Arabic Dialects

TL;DR

The paper tackles zero-shot multi-speaker TTS for Arabic, addressing resource scarcity by repurposing the QASR dataset and integrating dialect labels. It fine-tunes the open-source XTTS model on a refined, dialect-annotated Arabic corpus and evaluates on 31 unseen speakers plus an in-house dialect dataset, using SECS, WER, and human judgments. Key findings show that baseline XTTS yields lower WER, while dialect-token fine-tuning improves speaker-consistency and dialect fidelity on in-house data, with human evaluators confirming naturalness comparable to supervised baselines. The work demonstrates the feasibility of Arabic ZS-TTS with dialect-aware conditioning and provides a pathway for broader dialect coverage.

Abstract

Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources. We address this gap for Arabic, a language of more than 450 million native speakers, by first adapting a sizeable existing dataset to suit the needs of speech synthesis. Additionally, we employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting. Subsequently, we fine-tune the XTTS\footnote{https://docs.coqui.ai/en/latest/models/xtts.html}\footnote{https://medium.com/machine-learns/xtts-v2-new-version-of-the-open-source-text-to-speech-model-af73914db81f}\footnote{https://medium.com/@erogol/xtts-v1-techincal-notes-eb83ff05bdc} model, an open-source architecture. We then evaluate our models on a dataset comprising 31 unseen speakers and an in-house dialectal dataset. Our automated and human evaluation results show convincing performance while capable of generating dialectal speech. Our study highlights significant potential for improvements in this emerging area of research in Arabic.
Paper Structure (11 sections, 3 figures, 2 tables)

This paper contains 11 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: A zero-shot multi-speaker TTS system. It synthesizes speech based on the input text as well as preserves the acoustic features of input speech segment.
  • Figure 2: The three stages of training XTTS model. Stage 1. Train the VQ-VAE to learn the discrete audio tokens representation from audio waveform. Stage 2. A GPT2-based model is trained to autoregressively generate discrete audio tokens from the concatenation of speaker, text, and discrete audio tokens embeddings. Stage 3. HifiGAN is trained as a decoder to generate audio wavefrom from the discrete audio tokens, generated by the GPT2.
  • Figure 3: QASR data distribution across different dialects.