Table of Contents
Fetching ...

RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis

Haoxiang Shi, Jianzong Wang, Xulong Zhang, Ning Cheng, Jun Yu, Jing Xiao

TL;DR

RSET addresses the challenge of fine-grained emotion intensity control in TTS by introducing a remapping-based sorting method to model intra-class intensity and a mutual information–based decoupling framework to separate speaker and emotion representations. The approach couples an emotion intensity controller with an emotion embedding candidate pool and attention-based fusion, while enforcing speaker consistency and minimizing information leakage. Experimental results on the ESD dataset show that RSET achieves superior MOS, SMOS, and emotional accuracy with competitive spectral distortion compared to strong baselines, and ablations confirm the necessity of each component. The work advances practical emotion transfer TTS by enabling perceptible, controllable emotion intensity without sacrificing speaker identity, enhancing expressiveness for applications in education and voice assistants.

Abstract

Although current Text-To-Speech (TTS) models are able to generate high-quality speech samples, there are still challenges in developing emotion intensity controllable TTS. Most existing TTS models achieve emotion intensity control by extracting intensity information from reference speeches. Unfortunately, limited by the lack of modeling for intra-class emotion intensity and the model's information decoupling capability, the generated speech cannot achieve fine-grained emotion intensity control and suffers from information leakage issues. In this paper, we propose an emotion transfer TTS model, which defines a remapping-based sorting method to model intra-class relative intensity information, combined with Mutual Information (MI) to decouple speaker and emotion information, and synthesizes expressive speeches with perceptible intensity differences. Experiments show that our model achieves fine-grained emotion control while preserving speaker information.

RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis

TL;DR

RSET addresses the challenge of fine-grained emotion intensity control in TTS by introducing a remapping-based sorting method to model intra-class intensity and a mutual information–based decoupling framework to separate speaker and emotion representations. The approach couples an emotion intensity controller with an emotion embedding candidate pool and attention-based fusion, while enforcing speaker consistency and minimizing information leakage. Experimental results on the ESD dataset show that RSET achieves superior MOS, SMOS, and emotional accuracy with competitive spectral distortion compared to strong baselines, and ablations confirm the necessity of each component. The work advances practical emotion transfer TTS by enabling perceptible, controllable emotion intensity without sacrificing speaker identity, enhancing expressiveness for applications in education and voice assistants.

Abstract

Although current Text-To-Speech (TTS) models are able to generate high-quality speech samples, there are still challenges in developing emotion intensity controllable TTS. Most existing TTS models achieve emotion intensity control by extracting intensity information from reference speeches. Unfortunately, limited by the lack of modeling for intra-class emotion intensity and the model's information decoupling capability, the generated speech cannot achieve fine-grained emotion intensity control and suffers from information leakage issues. In this paper, we propose an emotion transfer TTS model, which defines a remapping-based sorting method to model intra-class relative intensity information, combined with Mutual Information (MI) to decouple speaker and emotion information, and synthesizes expressive speeches with perceptible intensity differences. Experiments show that our model achieves fine-grained emotion control while preserving speaker information.
Paper Structure (17 sections, 11 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 17 sections, 11 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: The architecture of RSET model. The upper part consists of three modules from left to right: the information decoupling module, the emotion intensity control module, and the synthesizer module. The emotion intensity control module includes an emotion intensity controller and a fusion module. The lower section illustrates the remapping sorting method's overall process, the final sample points represent speech features containing fine-grained intensity information, constituting the emotion-embedding candidate pool for emotion speech synthesis.
  • Figure 2: Candidate pool and attention fusion module during inference.
  • Figure 3: Comparison curve of emotion sorting accuracy. The straight line in the center indicates the mean accuracy for each intensity level, with lighter hues on both sides representing the variance. Blue represents RSET, while orange corresponds to Mixed Emotion.
  • Figure 4: The confusion matrix for different emotion intensity results. It utilizes the horizontal axis to depict the actual stages of intensity values, while the vertical axis corresponds to the artificially sorted results.
  • Figure 5: A/B preference test for RSET and Mixed Emotion.