Table of Contents
Fetching ...

DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee

TL;DR

DiEmo-TTS tackles cross-speaker emotion transfer by disentangling emotion from speaker identity through a self-supervised distillation framework that integrates cluster-driven sampling, information perturbation, and emotion-cluster matching. The method introduces emotion-disentangled DINO with enhanced clustering and a cosine loss, plus a dual conditioning transformer for robust style fusion within a FastSpeech 2–based TTS pipeline. Key contributions include a cluster-informed emotion representation, a formant-based information perturbation to distort speaker traits, and style-adaptive conditioning that preserves emotional expressiveness while maintaining target timbre. Experimental results on emotional speech datasets show state-of-the-art performance in naturalness, speaker similarity, and emotion expressiveness, with ablations confirming the importance of each component and a note on relying on pseudo-labels for emotion dimensions.

Abstract

Cross-speaker emotion transfer in speech synthesis relies on extracting speaker-independent emotion embeddings for accurate emotion modeling without retaining speaker traits. However, existing timbre compression methods fail to fully separate speaker and emotion characteristics, causing speaker leakage and degraded synthesis quality. To address this, we propose DiEmo-TTS, a self-supervised distillation method to minimize emotional information loss and preserve speaker identity. We introduce cluster-driven sampling and information perturbation to preserve emotion while removing irrelevant factors. To facilitate this process, we propose an emotion clustering and matching approach using emotional attribute prediction and speaker embeddings, enabling generalization to unlabeled data. Additionally, we designed a dual conditioning transformer to integrate style features better. Experimental results confirm the effectiveness of our method in learning speaker-irrelevant emotion embeddings.

DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech

TL;DR

DiEmo-TTS tackles cross-speaker emotion transfer by disentangling emotion from speaker identity through a self-supervised distillation framework that integrates cluster-driven sampling, information perturbation, and emotion-cluster matching. The method introduces emotion-disentangled DINO with enhanced clustering and a cosine loss, plus a dual conditioning transformer for robust style fusion within a FastSpeech 2–based TTS pipeline. Key contributions include a cluster-informed emotion representation, a formant-based information perturbation to distort speaker traits, and style-adaptive conditioning that preserves emotional expressiveness while maintaining target timbre. Experimental results on emotional speech datasets show state-of-the-art performance in naturalness, speaker similarity, and emotion expressiveness, with ablations confirming the importance of each component and a note on relying on pseudo-labels for emotion dimensions.

Abstract

Cross-speaker emotion transfer in speech synthesis relies on extracting speaker-independent emotion embeddings for accurate emotion modeling without retaining speaker traits. However, existing timbre compression methods fail to fully separate speaker and emotion characteristics, causing speaker leakage and degraded synthesis quality. To address this, we propose DiEmo-TTS, a self-supervised distillation method to minimize emotional information loss and preserve speaker identity. We introduce cluster-driven sampling and information perturbation to preserve emotion while removing irrelevant factors. To facilitate this process, we propose an emotion clustering and matching approach using emotional attribute prediction and speaker embeddings, enabling generalization to unlabeled data. Additionally, we designed a dual conditioning transformer to integrate style features better. Experimental results confirm the effectiveness of our method in learning speaker-irrelevant emotion embeddings.

Paper Structure

This paper contains 16 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overall framework of DiEmo-TTS.
  • Figure 2: Comparision of our proposed method in terms of speaker and emotion similarity
  • Figure 3: t-SNE of speaker embeddings in ESD dataset.
  • Figure 4: Visualization of the relative positional information of emotional clusters, represented by azimuth and elevation angles, for each speaker in spherical coordinates. The numbers in the plot represent individual speakers.