Table of Contents
Fetching ...

EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS

Haoxun Li, Yu Liu, Yuqing Sun, Hanlei Shi, Leyuan Qu, Taihao Li

TL;DR

EMORL-TTS tackles the challenge of fine-grained emotion control in LLM-based TTS by modeling global emotional intensity in the Valence–Arousal–Dominance space and coupling it with local prosodic emphasis. It merges supervised fine-tuning with reinforcement learning through Group Relative Policy Optimization, guided by three rewards: emotion-category accuracy, intensity fidelity, and emphasis controllability, while keeping the BiCodec decoder fixed. The two-stage training process—emotion-controllable SFT followed by GRPO—yields significant improvements in emotion accuracy, intensity differentiation, and emphasis clarity, without sacrificing synthesis quality. This approach enables continuous, fine-grained emotion control in token-based LLM-TTS and opens avenues for cross-lingual and multimodal extensions.

Abstract

Recent LLM-based TTS systems achieve strong quality and zero-shot ability, but lack fine-grained emotional control due to their reliance on discrete speech tokens. Existing approaches either limit emotions to categorical labels or cannot generalize to LLM-based architectures. We propose EMORL-TTS (Fine-grained Emotion-controllable TTS with Reinforcement Learning), a framework that unifies global intensity control in the VAD space with local emphasis regulation. Our method combines supervised fine-tuning with reinforcement learning guided by task-specific rewards for emotion category, intensity, and emphasis. Moreover, we further investigate how emphasis placement modulates fine-grained emotion intensity. Experiments show that EMORL-TTS improves emotion accuracy, intensity differentiation, and emphasis clarity, while preserving synthesis quality comparable to strong LLM-based baselines.

EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS

TL;DR

EMORL-TTS tackles the challenge of fine-grained emotion control in LLM-based TTS by modeling global emotional intensity in the Valence–Arousal–Dominance space and coupling it with local prosodic emphasis. It merges supervised fine-tuning with reinforcement learning through Group Relative Policy Optimization, guided by three rewards: emotion-category accuracy, intensity fidelity, and emphasis controllability, while keeping the BiCodec decoder fixed. The two-stage training process—emotion-controllable SFT followed by GRPO—yields significant improvements in emotion accuracy, intensity differentiation, and emphasis clarity, without sacrificing synthesis quality. This approach enables continuous, fine-grained emotion control in token-based LLM-TTS and opens avenues for cross-lingual and multimodal extensions.

Abstract

Recent LLM-based TTS systems achieve strong quality and zero-shot ability, but lack fine-grained emotional control due to their reliance on discrete speech tokens. Existing approaches either limit emotions to categorical labels or cannot generalize to LLM-based architectures. We propose EMORL-TTS (Fine-grained Emotion-controllable TTS with Reinforcement Learning), a framework that unifies global intensity control in the VAD space with local emphasis regulation. Our method combines supervised fine-tuning with reinforcement learning guided by task-specific rewards for emotion category, intensity, and emphasis. Moreover, we further investigate how emphasis placement modulates fine-grained emotion intensity. Experiments show that EMORL-TTS improves emotion accuracy, intensity differentiation, and emphasis clarity, while preserving synthesis quality comparable to strong LLM-based baselines.

Paper Structure

This paper contains 9 sections, 9 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Overview of the proposed LLM-based fine-grained emotion-controllable TTS framework. Text, emotion, and intensity tokens are fed into the LLM, and the BiCodec decoder reconstructs the waveform. Reinforcement learning with multiple rewards (emotion classification, global emotion intensity, and local emphasis control) is employed to enhance controllability.
  • Figure 2: Aggregated emotion intensity scores across different parts of speech. Emphasis on adverbs and adjectives produces stronger perceived intensity compared to other categories.