EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS
Haoxun Li, Yu Liu, Yuqing Sun, Hanlei Shi, Leyuan Qu, Taihao Li
TL;DR
EMORL-TTS tackles the challenge of fine-grained emotion control in LLM-based TTS by modeling global emotional intensity in the Valence–Arousal–Dominance space and coupling it with local prosodic emphasis. It merges supervised fine-tuning with reinforcement learning through Group Relative Policy Optimization, guided by three rewards: emotion-category accuracy, intensity fidelity, and emphasis controllability, while keeping the BiCodec decoder fixed. The two-stage training process—emotion-controllable SFT followed by GRPO—yields significant improvements in emotion accuracy, intensity differentiation, and emphasis clarity, without sacrificing synthesis quality. This approach enables continuous, fine-grained emotion control in token-based LLM-TTS and opens avenues for cross-lingual and multimodal extensions.
Abstract
Recent LLM-based TTS systems achieve strong quality and zero-shot ability, but lack fine-grained emotional control due to their reliance on discrete speech tokens. Existing approaches either limit emotions to categorical labels or cannot generalize to LLM-based architectures. We propose EMORL-TTS (Fine-grained Emotion-controllable TTS with Reinforcement Learning), a framework that unifies global intensity control in the VAD space with local emphasis regulation. Our method combines supervised fine-tuning with reinforcement learning guided by task-specific rewards for emotion category, intensity, and emphasis. Moreover, we further investigate how emphasis placement modulates fine-grained emotion intensity. Experiments show that EMORL-TTS improves emotion accuracy, intensity differentiation, and emphasis clarity, while preserving synthesis quality comparable to strong LLM-based baselines.
