No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS

Seungyoun Shin; Dongha Ahn; Jiwoo Kim; Sungwook Jeon

No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS

Seungyoun Shin, Dongha Ahn, Jiwoo Kim, Sungwook Jeon

TL;DR

This work addresses the lack of a verifiable automatic reward for prosody in TTS, showing that GRPO trained on transcription metrics tends to collapse prosody into monotone speech even as intelligibility improves. The authors propose iterative Direct Preference Optimization (DPO) with small human preference batches to directly optimize prosodic naturalness while regularizing to the current model, evaluated on KoCC-TTS. Results show that DPO achieves the highest human preference (ELO) with competitive character error rate (CER), outperforming GRPO and commercial baselines. The findings demonstrate a data-efficient, human-in-the-loop path to natural and robust TTS when prosodic quality cannot be reliably rewarded automatically, and they release KoCC-TTS as a benchmark for future work on prosody evaluation in Korean call-center styles.

Abstract

Recent work reports gains in neural text-to-speech (TTS) with Group Relative Policy Optimization (GRPO). However, in the absence of a verifiable reward for \textit{prosody}, GRPO trained on transcription-oriented signals (CER/NLL) lowers error rates yet collapses prosody into monotone, unnatural speech; adding speaker-similarity further destabilizes training and degrades CER. We address this with an \textit{iterative Direct Preference Optimization (DPO)} scheme that uses only a few hundred human-labeled preference pairs per round to directly optimize prosodic naturalness while regularizing to the current model. On \textbf{KoCC-TTS}, a curated dataset of authentic Korean call center interactions capturing task-oriented dialogues, our method attains the highest human preference (ELO) with competitive CER, outperforming GRPO and strong commercial baselines. These results suggest that when prosody cannot be rewarded automatically, \textit{human preference optimization} offers a practical and data-efficient path to natural and robust TTS. The demo page is available at \href{https://tts.ch.dev}

No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS

TL;DR

Abstract

No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)