Table of Contents
Fetching ...

TTSDS -- Text-to-Speech Distribution Score

Christoph Minixhofer, Ondřej Klejch, Peter Bell

TL;DR

The paper proposes TTSDS, a multi-factor, distribution-based benchmark for evaluating text-to-speech systems by measuring how synthetic speech matches real speech across intelligibility, prosody, speaker similarity, environment, and a general distribution factor. Each factor is quantified by matching high-dimensional SSL and task-specific features between real and synthetic data using $W_2$ (2-Wasserstein) distances, with the overall score $S$ computed as a normalized combination of a real-speech and noise-distance term. Evaluations on 35 TTS systems from 2008–2024 show that the average of factor scores correlates strongly with human judgments ($\rho$ between 0.60 and 0.83), outperforming MOS-prediction baselines, and revealing shifting evaluator priorities over time (environment vs. prosody). The framework demonstrates robustness across evolving architectures and datasets, and is accompanied by an open benchmark suite and leaderboard. Overall, TTSDS provides a nuanced, transferable metric that better captures perceptual speech quality than traditional MOS-based approaches.

Abstract

Many recently published Text-to-Speech (TTS) systems produce audio close to real speech. However, TTS evaluation needs to be revisited to make sense of the results obtained with the new architectures, approaches and datasets. We propose evaluating the quality of synthetic speech as a combination of multiple factors such as prosody, speaker identity, and intelligibility. Our approach assesses how well synthetic speech mirrors real speech by obtaining correlates of each factor and measuring their distance from both real speech datasets and noise datasets. We benchmark 35 TTS systems developed between 2008 and 2024 and show that our score computed as an unweighted average of factors strongly correlates with the human evaluations from each time period.

TTSDS -- Text-to-Speech Distribution Score

TL;DR

The paper proposes TTSDS, a multi-factor, distribution-based benchmark for evaluating text-to-speech systems by measuring how synthetic speech matches real speech across intelligibility, prosody, speaker similarity, environment, and a general distribution factor. Each factor is quantified by matching high-dimensional SSL and task-specific features between real and synthetic data using (2-Wasserstein) distances, with the overall score computed as a normalized combination of a real-speech and noise-distance term. Evaluations on 35 TTS systems from 2008–2024 show that the average of factor scores correlates strongly with human judgments ( between 0.60 and 0.83), outperforming MOS-prediction baselines, and revealing shifting evaluator priorities over time (environment vs. prosody). The framework demonstrates robustness across evolving architectures and datasets, and is accompanied by an open benchmark suite and leaderboard. Overall, TTSDS provides a nuanced, transferable metric that better captures perceptual speech quality than traditional MOS-based approaches.

Abstract

Many recently published Text-to-Speech (TTS) systems produce audio close to real speech. However, TTS evaluation needs to be revisited to make sense of the results obtained with the new architectures, approaches and datasets. We propose evaluating the quality of synthetic speech as a combination of multiple factors such as prosody, speaker identity, and intelligibility. Our approach assesses how well synthetic speech mirrors real speech by obtaining correlates of each factor and measuring their distance from both real speech datasets and noise datasets. We benchmark 35 TTS systems developed between 2008 and 2024 and show that our score computed as an unweighted average of factors strongly correlates with the human evaluations from each time period.
Paper Structure (10 sections, 6 equations, 4 figures, 2 tables)

This paper contains 10 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Distribution of the best (left) and worst (right) TTS Arena system with respect to Hubert representations. $S$ denotes the score.
  • Figure 2: Spearman correlation between the subjective measure, benchmark systems and our benchmark.
  • Figure 3: Development of factor score correlation coefficients over time from early speech synthesis (Blizzard'08) to the latest systems (TTS Arena).
  • Figure 4: Results of Wilcoxon signed-rank tests between systems’ extracted features. $\blacksquare$ indicates a significant difference between a pair of systems.