Table of Contents
Fetching ...

How Open is Open TTS? A Practical Evaluation of Open Source TTS Tools

Teodora Răgman, Adrian Bogdan Stânea, Horia Cucu, Adriana Stan

Abstract

Open-source text-to-speech (TTS) frameworks have emerged as highly adaptable platforms for developing speech synthesis systems across a wide range of languages. However, their applicability is not uniform -- particularly when the target language is under-resourced or when computational resources are constrained. In this study, we systematically assess the feasibility of building novel TTS models using four widely adopted open-source architectures: FastPitch, VITS, Grad-TTS, and Matcha-TTS. Our evaluation spans multiple dimensions, including qualitative aspects such as ease of installation, dataset preparation, and hardware requirements, as well as quantitative assessments of synthesis quality for Romanian. We employ both objective metrics and subjective listening tests to evaluate intelligibility, speaker similarity, and naturalness of the generated speech. The results reveal significant challenges in tool chain setup, data preprocessing, and computational efficiency, which can hinder adoption in low-resource contexts. By grounding the analysis in reproducible protocols and accessible evaluation criteria, this work aims to inform best practices and promote more inclusive, language-diverse TTS development. All information needed to reproduce this study (i.e. code and data) are available in our git repository: https://gitlab.com/opentts_ragman/OpenTTS

How Open is Open TTS? A Practical Evaluation of Open Source TTS Tools

Abstract

Open-source text-to-speech (TTS) frameworks have emerged as highly adaptable platforms for developing speech synthesis systems across a wide range of languages. However, their applicability is not uniform -- particularly when the target language is under-resourced or when computational resources are constrained. In this study, we systematically assess the feasibility of building novel TTS models using four widely adopted open-source architectures: FastPitch, VITS, Grad-TTS, and Matcha-TTS. Our evaluation spans multiple dimensions, including qualitative aspects such as ease of installation, dataset preparation, and hardware requirements, as well as quantitative assessments of synthesis quality for Romanian. We employ both objective metrics and subjective listening tests to evaluate intelligibility, speaker similarity, and naturalness of the generated speech. The results reveal significant challenges in tool chain setup, data preprocessing, and computational efficiency, which can hinder adoption in low-resource contexts. By grounding the analysis in reproducible protocols and accessible evaluation criteria, this work aims to inform best practices and promote more inclusive, language-diverse TTS development. All information needed to reproduce this study (i.e. code and data) are available in our git repository: https://gitlab.com/opentts_ragman/OpenTTS
Paper Structure (18 sections, 10 figures, 2 tables)

This paper contains 18 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Pronunciation accuracy and intelligibility, reported as Word Error Rate - WER ($\downarrow$) scores, for all models finetuned with 10 utterances per speaker (top) and 1000 utterances per speaker (bottom), evaluated at different stages of the finetuning process: 120 iterations (left); 400 iterations (center), and 4000 iterations (right).
  • Figure 2: Perceived naturalness, reported as Mean Opinion Score - UTMOS ($\uparrow$) scores, for all models finetuned with 10 utterances per speaker (top) and 1000 utterances per speaker (bottom), evaluated at different stages of the finetuning process: 120 iterations (left); 400 iterations (center), and 4000 iterations (right).
  • Figure 3: Speaker similarity, reported as Speaker Encoder Cosine Similarity - SECS ($\uparrow$) scores, for all models finetuned with 10 utterances per speaker (top) and 1000 utterances per speaker (bottom), evaluated at different stages of the finetuning process: 120 iterations (left); 400 iterations (center), and 4000 iterations (right).
  • Figure 4: Perceptual quality, reported as Perceptual Evaluation of Speech Quality - PESQ ($\uparrow$) scores, for all models finetuned with 10 utterances per speaker (top) and 1000 utterances per speaker (bottom), evaluated at different stages of the finetuning process: 120 iterations (left); 400 iterations (center), and 4000 iterations (right).
  • Figure 5: Perceptual inteligibility, reported as Short-Time Objective Intelligibility - STOI ($\uparrow$) scores, for all models finetuned with 10 utterances per speaker (top) and 1000 utterances per speaker (bottom), evaluated at different stages of the finetuning process: 120 iterations (left); 400 iterations (center), and 4000 iterations (right).
  • ...and 5 more figures