Table of Contents
Fetching ...

Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens

Ismail Rasim Ulgen, Zongyang Du, Junchen Lu, Philipp Koehn, Berrak Sisman

TL;DR

This work addresses the need for objective, interpretable evaluation of speech synthesis by separating intelligibility and prosody into targeted metrics. It introduces TTScore, a pair of conditional-likelihood metrics—$TTScore$-int and $TTScore$-pro—computed from text-conditioned seq2seq predictors over discrete speech tokens. Content tokens derived from HuBERT and clusterings measure intelligibility, while FACodec-based prosody tokens (phoneme-level) measure prosody without ground-truth references. Across SOMOS, VoiceMOS, and TTSArena benchmarks, TTScore-int and TTScore-pro show stronger alignment with human judgments of overall quality than traditional WER/CER and F0-based metrics, demonstrating robust, reference-free, speech-domain evaluation. The framework enables fine-grained diagnostics and holds promise for extending targeted evaluations to additional speech attributes.

Abstract

Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measure of intelligibility, while F0-RMSE and related pitch-based metrics offer a narrow, reference-dependent view of prosody. To address these limitations, we propose TTScore, a targeted and reference-free evaluation framework based on conditional prediction of discrete speech tokens. TTScore employs two sequence-to-sequence predictors conditioned on input text: TTScore-int, which measures intelligibility through content tokens, and TTScore-pro, which evaluates prosody through prosody tokens. For each synthesized utterance, the predictors compute the likelihood of the corresponding token sequences, yielding interpretable scores that capture alignment with intended linguistic content and prosodic structure. Experiments on the SOMOS, VoiceMOS, and TTSArena benchmarks demonstrate that TTScore-int and TTScore-pro provide reliable, aspect-specific evaluation and achieve stronger correlations with human judgments of overall quality than existing intelligibility and prosody-focused metrics.

Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens

TL;DR

This work addresses the need for objective, interpretable evaluation of speech synthesis by separating intelligibility and prosody into targeted metrics. It introduces TTScore, a pair of conditional-likelihood metrics—-int and -pro—computed from text-conditioned seq2seq predictors over discrete speech tokens. Content tokens derived from HuBERT and clusterings measure intelligibility, while FACodec-based prosody tokens (phoneme-level) measure prosody without ground-truth references. Across SOMOS, VoiceMOS, and TTSArena benchmarks, TTScore-int and TTScore-pro show stronger alignment with human judgments of overall quality than traditional WER/CER and F0-based metrics, demonstrating robust, reference-free, speech-domain evaluation. The framework enables fine-grained diagnostics and holds promise for extending targeted evaluations to additional speech attributes.

Abstract

Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measure of intelligibility, while F0-RMSE and related pitch-based metrics offer a narrow, reference-dependent view of prosody. To address these limitations, we propose TTScore, a targeted and reference-free evaluation framework based on conditional prediction of discrete speech tokens. TTScore employs two sequence-to-sequence predictors conditioned on input text: TTScore-int, which measures intelligibility through content tokens, and TTScore-pro, which evaluates prosody through prosody tokens. For each synthesized utterance, the predictors compute the likelihood of the corresponding token sequences, yielding interpretable scores that capture alignment with intended linguistic content and prosodic structure. Experiments on the SOMOS, VoiceMOS, and TTSArena benchmarks demonstrate that TTScore-int and TTScore-pro provide reliable, aspect-specific evaluation and achieve stronger correlations with human judgments of overall quality than existing intelligibility and prosody-focused metrics.

Paper Structure

This paper contains 31 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: a.1) Text-to-content token generator training a.2) TTScore-int as the conditional likelihood from content token generator for the given content token sequence from the synthesized speech and corresponding text b.1) Text-to-prosody token generator training with phoneme-level pooling b.2) TTScore-pro as the conditional likelihood of phoneme-level prosody tokens for a given text and synthesized speech
  • Figure 2: Score distribution analysis for the proposed prosody metric