RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis
Haoxiang Shi, Jianzong Wang, Xulong Zhang, Ning Cheng, Jun Yu, Jing Xiao
TL;DR
RSET addresses the challenge of fine-grained emotion intensity control in TTS by introducing a remapping-based sorting method to model intra-class intensity and a mutual information–based decoupling framework to separate speaker and emotion representations. The approach couples an emotion intensity controller with an emotion embedding candidate pool and attention-based fusion, while enforcing speaker consistency and minimizing information leakage. Experimental results on the ESD dataset show that RSET achieves superior MOS, SMOS, and emotional accuracy with competitive spectral distortion compared to strong baselines, and ablations confirm the necessity of each component. The work advances practical emotion transfer TTS by enabling perceptible, controllable emotion intensity without sacrificing speaker identity, enhancing expressiveness for applications in education and voice assistants.
Abstract
Although current Text-To-Speech (TTS) models are able to generate high-quality speech samples, there are still challenges in developing emotion intensity controllable TTS. Most existing TTS models achieve emotion intensity control by extracting intensity information from reference speeches. Unfortunately, limited by the lack of modeling for intra-class emotion intensity and the model's information decoupling capability, the generated speech cannot achieve fine-grained emotion intensity control and suffers from information leakage issues. In this paper, we propose an emotion transfer TTS model, which defines a remapping-based sorting method to model intra-class relative intensity information, combined with Mutual Information (MI) to decouple speaker and emotion information, and synthesizes expressive speeches with perceptible intensity differences. Experiments show that our model achieves fine-grained emotion control while preserving speaker information.
