Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

Yuchen Hu; Chen Chen; Siyin Wang; Eng Siong Chng; Chao Zhang

Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

Yuchen Hu, Chen Chen, Siyin Wang, Eng Siong Chng, Chao Zhang

TL;DR

This work tackles robustness gaps in zero-shot autoregressive TTS by introducing Reverse Inference Optimization (RIO), a reinforcement-learning-from-human-feedback approach that requires no human annotations. A Bayesian reverse-inference criterion selects production-perception-consistent exemplars by enforcing agreement between forward generation $P(\mathbf{Y}|\mathbf{T}_{\mathrm{Y}},\mathbf{T}_{\mathrm{X}},\mathbf{X})$ and reverse inference $P(\mathbf{X}|\mathbf{T}_{\mathrm{X}},\mathbf{T}_{\mathrm{Y}},\hat{\mathbf{Y}})$, shaping a sampling-annotating-learning loop that avoids reward models and pairwise preferences. The method uses MOS-estimated quality labeling to form positive/negative pools and optimizes via a KL-stabilized implicit reward, achieving substantial improvements in WER, MOS, and speaker similarity, while reducing bad outputs to near-zero levels. Importantly, results scale well to larger backbones (e.g., VoiceCraft-830M), bringing synthesized speech closer to ground-truth quality and demonstrating practical viability for robust, deployment-ready zero-shot TTS systems.

Abstract

In this paper, we propose reverse inference optimization (RIO), a simple and effective method designed to enhance the robustness of autoregressive-model-based zero-shot text-to-speech (TTS) systems using reinforcement learning from human feedback (RLHF). To assess the quality of speech produced by the TTS system without human annotations, RIO introduces a novel concept termed as reverse inference based on the Bayesian principle, which suggests that a high-quality generated speech should be able to be used as a prompt for subsequent generation using the same TTS model. By leveraging reverse inference as the standard to select exemplars used in RLHF from the speech samples generated by the TTS system itself, RIO steers the subsequent optimization towards a direction of enhancing the TTS robustness. The RIO framework, comprising sampling, automatic annotating, and learning, obviates the need for a reward model or pairwise preference data, and significantly improves the stability of zero-shot TTS performance by reducing the discrepancies between training and inference conditions. Our experimental results verify that RIO can effectively improve both subjective and objective metrics, including mean opinion scores, word error rates, and speaker similarity. Remarkably, RIO can also diminish the incidence of bad outputs to nearly zero percent, rivalling the robustness when using ground-truth speech as the prompt.

Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

TL;DR

and reverse inference

, shaping a sampling-annotating-learning loop that avoids reward models and pairwise preferences. The method uses MOS-estimated quality labeling to form positive/negative pools and optimizes via a KL-stabilized implicit reward, achieving substantial improvements in WER, MOS, and speaker similarity, while reducing bad outputs to near-zero levels. Importantly, results scale well to larger backbones (e.g., VoiceCraft-830M), bringing synthesized speech closer to ground-truth quality and demonstrating practical viability for robust, deployment-ready zero-shot TTS systems.

Abstract

Paper Structure (14 sections, 6 equations, 5 figures, 3 tables)

This paper contains 14 sections, 6 equations, 5 figures, 3 tables.

Introduction
Related Work
Methodology
Problem Formulation of Zero-shot TTS
Reverse Inference
Optimization without Pairwise Preference Data
Experimental Setup
Results and Analysis
Objective Results
Human Evaluation
Scalability to Larger Backbone Models
Analysis of Zero-shot TTS Robustness
More Discussions and Future Work
Conclusion

Figures (5)

Figure 1: The overview of RIO. (a) Zero-shot TTS: Codec language model generates the synthesized speech conditioned on a 3-second speech prompt and text (including both text prompt and transcription), where the synthesized speech could be high-quality but not necessarily perceptually consistent with its speech prompt. (b) Reverse inference: The synthesized speech is sent back to the TTS model to predict the original prompt speech, whose quality can reflect the production-perception consistency (PPC) of previously synthesized speech. We then set those PPC samples as positive exemplars in RLHF to optimize the TTS model towards better robustness.
Figure 2: MOS distributions of $\hat{\mathbf{Y}}$ synthesized by different models using zero-shot generations with MOS > 3.8. The circle/cross denotes good/bad reverse inference results.
Figure 3: Results of A/B test. "VC" and "GT" denote the "VoiceCraft" and "Ground-Truth".
Figure 4: MOS and WER Results on 830M models.
Figure 5: MOS score distributions of VoiceCraft-330M baseline and our proposed RIO approach.

Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

TL;DR

Abstract

Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (5)