LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models

Ahmed Khaled Khamis; Hesham Ali

LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models

Ahmed Khaled Khamis, Hesham Ali

TL;DR

The first publicly available Egyptian Arabic TTS dataset is constructed, a reproducible synthetic data generation pipeline for dialectal TTS is constructed, and an open-source fine-tuned model is fine-tuned.

Abstract

Despite the advances in neural text to speech (TTS), many Arabic dialectal varieties remain marginally addressed, with most resources concentrated on Modern Spoken Arabic (MSA) and Gulf dialects, leaving Egyptian Arabic -- the most widely understood Arabic dialect -- severely under-resourced. We address this gap by introducing NileTTS: 38 hours of transcribed speech from two speakers across diverse domains including medical, sales, and general conversations. We construct this dataset using a novel synthetic pipeline: large language models (LLM) generate Egyptian Arabic content, which is then converted to natural speech using audio synthesis tools, followed by automatic transcription and speaker diarization with manual quality verification. We fine-tune XTTS v2, a state-of-the-art multilingual TTS model, on our dataset and evaluate against the baseline model trained on other Arabic dialects. Our contributions include: (1) the first publicly available Egyptian Arabic TTS dataset, (2) a reproducible synthetic data generation pipeline for dialectal TTS, and (3) an open-source fine-tuned model. All resources are released to advance Egyptian Arabic speech synthesis research.

LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models

TL;DR

Abstract

Paper Structure (24 sections, 2 figures, 3 tables)

This paper contains 24 sections, 2 figures, 3 tables.

Introduction
Related Work
Arabic Text-to-Speech
Synthetic Data for Speech
Multilingual TTS and XTTS
Dataset Construction
Content Generation
Audio Synthesis
Transcription and Segmentation
Speaker Diarization
Quality Control
Dataset Statistics
Model Finetuning
Base Model: XTTS v2
Finetuning Configuration
...and 9 more sections

Figures (2)

Figure 1: Overview of the NileTTS data generation pipeline. Egyptian Arabic content is generated by LLMs, converted to speech via neural audio synthesis, transcribed and segmented using Whisper, and annotated with speaker identities using ECAPA-TDNN embeddings. Manual quality control ensures accuracy before final dataset compilation.
Figure 2: Evaluation metrics throughout training: (a) Evaluation Loss, (b) Word Error Rate, (c) Character Error Rate, (d) Speaker Similarity. The red marker indicates the selected checkpoint at step 34,289 (epoch 8).

LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models

TL;DR

Abstract

LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)