Table of Contents
Fetching ...

RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

Cong Wang, Changfeng Gao, Yang Xiang, Zhihao Du, Keyu An, Han Zhao, Qian Chen, Xiangang Li, Yingming Gao, Ya Li

TL;DR

RRPO addresses reward hacking in differentiable reward optimization for LLM-based emotional TTS by introducing a robust reward model learned via hybrid regularization. The approach combines Label Smoothing, Energy-Adaptive Mixup, and Adversarial Training to produce a reward signal that aligns with human perception, guiding the policy toward genuine emotional prosody. Experiments show RRPO outperforms baselines in subjective expressiveness and naturalness, with strong cross-lingual generalization demonstrated on SER tasks. This work enables more reliable RL-based emotional TTS and offers a pathway to applying robust reward signals to other speech attributes and larger pre-training regimes.

Abstract

Differentiable reinforcement learning (RL) frameworks like DiffRO offer a powerful approach for controllable text-to-speech (TTS), but are vulnerable to reward hacking, particularly for nuanced tasks like emotion control. The policy model can exploit a vanilla Reward Model (RM) by generating acoustic artifacts to achieve spurious rewards, but at the cost of degrading perceptual quality. To address this, we propose Robust Reward Policy Optimization (RRPO), a novel framework that employs a hybrid regularization scheme. This scheme develops a robust RM whose reward signal is more reliably aligned with human perception, compelling the policy to abandon detrimental shortcuts and instead learn the complex features of genuine emotions. Our ablation study confirms the enhanced robustness of our RM, as evidenced by its strong cross-lingual generalization. The subjective evaluation demonstrates that this robust RM effectively mitigates reward hacking, leading to significant improvements in both emotional expressiveness and naturalness over all baselines. Demo page: https://lrwinr.github.io/RRPO-CosyVoice.

RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

TL;DR

RRPO addresses reward hacking in differentiable reward optimization for LLM-based emotional TTS by introducing a robust reward model learned via hybrid regularization. The approach combines Label Smoothing, Energy-Adaptive Mixup, and Adversarial Training to produce a reward signal that aligns with human perception, guiding the policy toward genuine emotional prosody. Experiments show RRPO outperforms baselines in subjective expressiveness and naturalness, with strong cross-lingual generalization demonstrated on SER tasks. This work enables more reliable RL-based emotional TTS and offers a pathway to applying robust reward signals to other speech attributes and larger pre-training regimes.

Abstract

Differentiable reinforcement learning (RL) frameworks like DiffRO offer a powerful approach for controllable text-to-speech (TTS), but are vulnerable to reward hacking, particularly for nuanced tasks like emotion control. The policy model can exploit a vanilla Reward Model (RM) by generating acoustic artifacts to achieve spurious rewards, but at the cost of degrading perceptual quality. To address this, we propose Robust Reward Policy Optimization (RRPO), a novel framework that employs a hybrid regularization scheme. This scheme develops a robust RM whose reward signal is more reliably aligned with human perception, compelling the policy to abandon detrimental shortcuts and instead learn the complex features of genuine emotions. Our ablation study confirms the enhanced robustness of our RM, as evidenced by its strong cross-lingual generalization. The subjective evaluation demonstrates that this robust RM effectively mitigates reward hacking, leading to significant improvements in both emotional expressiveness and naturalness over all baselines. Demo page: https://lrwinr.github.io/RRPO-CosyVoice.

Paper Structure

This paper contains 18 sections, 6 equations, 1 figure, 2 tables, 1 algorithm.

Figures (1)

  • Figure 1: The framework of our proposed Robust Reward Policy Optimization (RRPO).