Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback

Chen Chen; Yuchen Hu; Wen Wu; Helin Wang; Eng Siong Chng; Chao Zhang

Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback

Chen Chen, Yuchen Hu, Wen Wu, Helin Wang, Eng Siong Chng, Chao Zhang

TL;DR

UNO tackles the misalignment between subjective human speech quality evaluations and TTS training by introducing an uncertainty-aware optimization framework that directly maximizes the utility of zero-shot TTS outputs using human annotations without a reward model. The method uses a diverse sampling of speech prompts, binary desirability signals, and annotator uncertainty, stabilized by a reference term $Z_{\text{ref}}$ to train a value function $V_{\text{TTS}}$ and loss $\mathcal{L}_{\text{TTS}}$. Empirically, UNO substantially improves WER, SIM, and MOS estimates, with human evaluations corroborating gains and showing reduced variability, and it extends naturally to emotional TTS via valence/arousal prompts. This work provides a practical pathway for human-aligned, high-quality, zero-shot TTS and suggests broader applicability to other AIGC tasks where subjective evaluation is noisy and expensive.

Abstract

In recent years, text-to-speech (TTS) technology has witnessed impressive advancements, particularly with large-scale training datasets, showcasing human-level speech quality and impressive zero-shot capabilities on unseen speakers. However, despite human subjective evaluations, such as the mean opinion score (MOS), remaining the gold standard for assessing the quality of synthetic speech, even state-of-the-art TTS approaches have kept human feedback isolated from training that resulted in mismatched training objectives and evaluation metrics. In this work, we investigate a novel topic of integrating subjective human evaluation into the TTS training loop. Inspired by the recent success of reinforcement learning from human feedback, we propose a comprehensive sampling-annotating-learning framework tailored to TTS optimization, namely uncertainty-aware optimization (UNO). Specifically, UNO eliminates the need for a reward model or preference data by directly maximizing the utility of speech generations while considering the uncertainty that lies in the inherent variability in subjective human speech perception and evaluations. Experimental results of both subjective and objective evaluations demonstrate that UNO considerably improves the zero-shot performance of TTS models in terms of MOS, word error rate, and speaker similarity. Additionally, we present a remarkable ability of UNO that it can adapt to the desired speaking style in emotional TTS seamlessly and flexibly.

Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback

TL;DR

to train a value function

and loss

. Empirically, UNO substantially improves WER, SIM, and MOS estimates, with human evaluations corroborating gains and showing reduced variability, and it extends naturally to emotional TTS via valence/arousal prompts. This work provides a practical pathway for human-aligned, high-quality, zero-shot TTS and suggests broader applicability to other AIGC tasks where subjective evaluation is noisy and expensive.

Abstract

Paper Structure (21 sections, 12 equations, 9 figures, 4 tables)

This paper contains 21 sections, 12 equations, 9 figures, 4 tables.

Introduction
Related Work
Background
Methodology
Data Sampling and Annotating
Uncertainty-aware Learning for TTS
Experiments Setup
Result and Analysis
Objective Results.
Human Evaluation.
Analysis on Uncertainty.
Extension on Emotional TTS.
Conclusion
Frequently Asked Questions
More Discussion on LLM and TTS Calibration
...and 6 more sections

Figures (9)

Figure 1: This sampling-annotating-learning framework of UNO. In annotating, the "like" and "dislike" symbols denote the binary signal for whether this synthetic speech is desirable or not, and the digits represents the uncertainty caused by the variability of annotators.
Figure 2: Visualization of UNO. The yellow-to-red arrow indicates the change before and after UNO. The token-level visualization (upper part) is projected by the generated tokens, while in utterance-level visualization (lower part), each point is projected by the embedding of an utterance. A cluster of data points shown in red circles are failed zero-shot TTS cases.
Figure 2: Results on human evaluation.
Figure 3: Comparison results of uncertainty and MOS. $u^2$ is estimated by I-CNF models.
Figure 4: WER and MOS Results on 830M models.
...and 4 more figures

Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback

TL;DR

Abstract

Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback

Authors

TL;DR

Abstract

Table of Contents

Figures (9)