Table of Contents
Fetching ...

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

Christoph Minixhofer, Ondrej Klejch, Peter Bell

TL;DR

This work introduces Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS that is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated.

Abstract

Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are frequently used, but rarely validated against subjective ones. Both kinds of metrics are challenged by recent TTS systems capable of producing synthetic speech indistinguishable from real speech. In this work, we introduce Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS. Across a range of domains and languages, it is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated. We also release a range of resources for evaluating synthetic speech close to real speech: A dataset with over 11,000 subjective opinion score ratings; a pipeline for continually recreating a multilingual test dataset to avoid data leakage; and a continually updated benchmark for TTS in 14 languages.

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

TL;DR

This work introduces Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS that is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated.

Abstract

Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are frequently used, but rarely validated against subjective ones. Both kinds of metrics are challenged by recent TTS systems capable of producing synthetic speech indistinguishable from real speech. In this work, we introduce Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS. Across a range of domains and languages, it is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated. We also release a range of resources for evaluating synthetic speech close to real speech: A dataset with over 11,000 subjective opinion score ratings; a pipeline for continually recreating a multilingual test dataset to avoid data leakage; and a continually updated benchmark for TTS in 14 languages.

Paper Structure

This paper contains 31 sections, 3 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Distribution of $F_0$ in TTSDS for ground-truth, synthetic, and noise datasets.
  • Figure 2: Correlation of three representative objective metrics with human MOS across the four datasets. Each colour/marker denotes a domain. Solid line = overall least-squares fit; dashed/dotted lines = domain-specific fits; each with corresponding Pearson $r$.
  • Figure 3: TTSDS2 scores across 14 languages. $n$ indicates the number of systems per language.
  • Figure 4: Interface for Mean Opinion Score (MOS) listening tests.
  • Figure 5: Interface for Comparison MOS (CMOS) listening tests.
  • ...and 3 more figures