Table of Contents
Fetching ...

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

TL;DR

ParsVoice tackles the data scarcity of Persian TTS by introducing the largest open, multi-speaker Persian speech corpus to date, built via an automated, scalable pipeline that converts 2,000 audiobooks into $3{,}526$ hours of raw speech and produces an $1{,}804$-hour high-quality subset from 470+ speakers. The pipeline integrates a sentence-aware segmentation using ParsBERT, a boundary optimization mechanism with binary and linear search, and comprehensive Persian-specific text and audio quality assessments, along with both local and global speaker identification. Evaluation with XTTS demonstrates competitive naturalness ($MOS=3.6$) and speaker similarity ($MOS=4.0$) for unseen speakers, validating ParsVoice for multi-speaker TTS and rapid speaker adaptation. The dataset, along with the fully automated pipeline, is publicly released to accelerate Persian speech technology research and real-world deployment.

Abstract

Existing Persian speech datasets are typically smaller than their English counterparts, which creates a key limitation for developing Persian speech technologies. We address this gap by introducing ParsVoice, the largest Persian speech corpus designed specifically for text-to-speech(TTS) applications. We created an automated pipeline that transforms raw audiobook content into TTS-ready data, incorporating components such as a BERT-based sentence completion detector, a binary search boundary optimization method for precise audio-text alignment, and audio-text quality assessment frameworks tailored to Persian. The pipeline processes 2,000 audiobooks, yielding 3,526 hours of clean speech, which was further filtered into a 1,804-hour high-quality subset suitable for TTS, featuring more than 470 speakers. To validate the dataset, we fine-tuned XTTS for Persian, achieving a naturalness Mean Opinion Score (MOS) of 3.6/5 and a Speaker Similarity Mean Opinion Score (SMOS) of 4.0/5 demonstrating ParsVoice's effectiveness for training multi-speaker TTS systems. ParsVoice is the largest high-quality Persian speech dataset, offering speaker diversity and audio quality comparable to major English corpora. The complete dataset has been made publicly available to accelerate the development of Persian speech technologies. The ParsVoice dataset is publicly available at: https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice.

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

TL;DR

ParsVoice tackles the data scarcity of Persian TTS by introducing the largest open, multi-speaker Persian speech corpus to date, built via an automated, scalable pipeline that converts 2,000 audiobooks into hours of raw speech and produces an -hour high-quality subset from 470+ speakers. The pipeline integrates a sentence-aware segmentation using ParsBERT, a boundary optimization mechanism with binary and linear search, and comprehensive Persian-specific text and audio quality assessments, along with both local and global speaker identification. Evaluation with XTTS demonstrates competitive naturalness () and speaker similarity () for unseen speakers, validating ParsVoice for multi-speaker TTS and rapid speaker adaptation. The dataset, along with the fully automated pipeline, is publicly released to accelerate Persian speech technology research and real-world deployment.

Abstract

Existing Persian speech datasets are typically smaller than their English counterparts, which creates a key limitation for developing Persian speech technologies. We address this gap by introducing ParsVoice, the largest Persian speech corpus designed specifically for text-to-speech(TTS) applications. We created an automated pipeline that transforms raw audiobook content into TTS-ready data, incorporating components such as a BERT-based sentence completion detector, a binary search boundary optimization method for precise audio-text alignment, and audio-text quality assessment frameworks tailored to Persian. The pipeline processes 2,000 audiobooks, yielding 3,526 hours of clean speech, which was further filtered into a 1,804-hour high-quality subset suitable for TTS, featuring more than 470 speakers. To validate the dataset, we fine-tuned XTTS for Persian, achieving a naturalness Mean Opinion Score (MOS) of 3.6/5 and a Speaker Similarity Mean Opinion Score (SMOS) of 4.0/5 demonstrating ParsVoice's effectiveness for training multi-speaker TTS systems. ParsVoice is the largest high-quality Persian speech dataset, offering speaker diversity and audio quality comparable to major English corpora. The complete dataset has been made publicly available to accelerate the development of Persian speech technologies. The ParsVoice dataset is publicly available at: https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice.

Paper Structure

This paper contains 17 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Overview of the proposed pipeline.