Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens

Ismail Rasim Ulgen; Zongyang Du; Junchen Lu; Philipp Koehn; Berrak Sisman

Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens

Ismail Rasim Ulgen, Zongyang Du, Junchen Lu, Philipp Koehn, Berrak Sisman

TL;DR

This work addresses the need for objective, interpretable evaluation of speech synthesis by separating intelligibility and prosody into targeted metrics. It introduces TTScore, a pair of conditional-likelihood metrics—$TTScore$-int and $TTScore$-pro—computed from text-conditioned seq2seq predictors over discrete speech tokens. Content tokens derived from HuBERT and clusterings measure intelligibility, while FACodec-based prosody tokens (phoneme-level) measure prosody without ground-truth references. Across SOMOS, VoiceMOS, and TTSArena benchmarks, TTScore-int and TTScore-pro show stronger alignment with human judgments of overall quality than traditional WER/CER and F0-based metrics, demonstrating robust, reference-free, speech-domain evaluation. The framework enables fine-grained diagnostics and holds promise for extending targeted evaluations to additional speech attributes.

Abstract

Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measure of intelligibility, while F0-RMSE and related pitch-based metrics offer a narrow, reference-dependent view of prosody. To address these limitations, we propose TTScore, a targeted and reference-free evaluation framework based on conditional prediction of discrete speech tokens. TTScore employs two sequence-to-sequence predictors conditioned on input text: TTScore-int, which measures intelligibility through content tokens, and TTScore-pro, which evaluates prosody through prosody tokens. For each synthesized utterance, the predictors compute the likelihood of the corresponding token sequences, yielding interpretable scores that capture alignment with intended linguistic content and prosodic structure. Experiments on the SOMOS, VoiceMOS, and TTSArena benchmarks demonstrate that TTScore-int and TTScore-pro provide reliable, aspect-specific evaluation and achieve stronger correlations with human judgments of overall quality than existing intelligibility and prosody-focused metrics.

Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens

TL;DR

Abstract

Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)