Table of Contents
Fetching ...

ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages

Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee

TL;DR

This work tackles the scarcity of open, high-quality Persian TTS data by introducing ManaTTS, the largest public single-speaker Persian corpus (≈86 hours at 44.1 kHz) along with VirgoolInformal for ASR evaluation. It presents a fully open, MIT-licensed data collection and processing pipeline with novel sentence tokenization and a forced alignment method designed for low-resource languages, enabling reproducible dataset construction from Nasl-e-Mana. A Tacotron2-based TTS model trained on ManaTTS achieves a MOS of 3.76, approaching the MOS of 3.86 for ground-truth spectrograms and 4.01 for natural speech, validating the dataset’s quality for practical TTS development. The work emphasizes accessibility and reproducibility, offering an extendable framework for future Persian data collection and broader open-source impact in low-resource language technology.

Abstract

In this study, we introduce ManaTTS, the most extensive publicly accessible single-speaker Persian corpus, and a comprehensive framework for collecting transcribed speech datasets for the Persian language. ManaTTS, released under the open CC-0 license, comprises approximately 86 hours of audio with a sampling rate of 44.1 kHz. Alongside ManaTTS, we also generated the VirgoolInformal dataset to evaluate Persian speech recognition models used for forced alignment, extending over 5 hours of audio. The datasets are supported by a fully transparent, MIT-licensed pipeline, a testament to innovation in the field. It includes unique tools for sentence tokenization, bounded audio segmentation, and a novel forced alignment method. This alignment technique is specifically designed for low-resource languages, addressing a crucial need in the field. With this dataset, we trained a Tacotron2-based TTS model, achieving a Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of 3.86 for the utterances generated by the same vocoder and natural spectrogram, and the MOS of 4.01 for the natural waveform, demonstrating the exceptional quality and effectiveness of the corpus.

ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages

TL;DR

This work tackles the scarcity of open, high-quality Persian TTS data by introducing ManaTTS, the largest public single-speaker Persian corpus (≈86 hours at 44.1 kHz) along with VirgoolInformal for ASR evaluation. It presents a fully open, MIT-licensed data collection and processing pipeline with novel sentence tokenization and a forced alignment method designed for low-resource languages, enabling reproducible dataset construction from Nasl-e-Mana. A Tacotron2-based TTS model trained on ManaTTS achieves a MOS of 3.76, approaching the MOS of 3.86 for ground-truth spectrograms and 4.01 for natural speech, validating the dataset’s quality for practical TTS development. The work emphasizes accessibility and reproducibility, offering an extendable framework for future Persian data collection and broader open-source impact in low-resource language technology.

Abstract

In this study, we introduce ManaTTS, the most extensive publicly accessible single-speaker Persian corpus, and a comprehensive framework for collecting transcribed speech datasets for the Persian language. ManaTTS, released under the open CC-0 license, comprises approximately 86 hours of audio with a sampling rate of 44.1 kHz. Alongside ManaTTS, we also generated the VirgoolInformal dataset to evaluate Persian speech recognition models used for forced alignment, extending over 5 hours of audio. The datasets are supported by a fully transparent, MIT-licensed pipeline, a testament to innovation in the field. It includes unique tools for sentence tokenization, bounded audio segmentation, and a novel forced alignment method. This alignment technique is specifically designed for low-resource languages, addressing a crucial need in the field. With this dataset, we trained a Tacotron2-based TTS model, achieving a Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of 3.86 for the utterances generated by the same vocoder and natural spectrogram, and the MOS of 4.01 for the natural waveform, demonstrating the exceptional quality and effectiveness of the corpus.
Paper Structure (42 sections, 11 figures, 11 tables)

This paper contains 42 sections, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Dataset processing pipelines.
  • Figure 2: Transcription Module
  • Figure 3: Distribution of the duration of audio chunks.
  • Figure 4: Distribution of search type and match quality of accepted chunks.
  • Figure 5: Distribution of CER values across all chunks. The vertical lines denote the threshold values for the HIGH, MIDDLE, and REJECT match qualities as discussed in the section \ref{['subsubsec:forced-alignment']}.
  • ...and 6 more figures