DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance

Jinhyeok Yang; Junhyeok Lee; Hyeong-Seok Choi; Seunghun Ji; Hyeongju Kim; Juheon Lee

DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance

Jinhyeok Yang, Junhyeok Lee, Hyeong-Seok Choi, Seunghun Ji, Hyeongju Kim, Juheon Lee

TL;DR

DualSpeech is introduced, a TTS model that integrates phoneme-level latent diffusion with dual classifier-free guidance that enables exceptional control over speaker-fidelity and text-intelligibility.

Abstract

Text-to-Speech (TTS) models have advanced significantly, aiming to accurately replicate human speech's diversity, including unique speaker identities and linguistic nuances. Despite these advancements, achieving an optimal balance between speaker-fidelity and text-intelligibility remains a challenge, particularly when diverse control demands are considered. Addressing this, we introduce DualSpeech, a TTS model that integrates phoneme-level latent diffusion with dual classifier-free guidance. This approach enables exceptional control over speaker-fidelity and text-intelligibility. Experimental results demonstrate that by utilizing the sophisticated control, DualSpeech surpasses existing state-of-the-art TTS models in performance. Demos are available at https://bit.ly/48Ewoib.

DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance

TL;DR

Abstract

Paper Structure (15 sections, 2 equations, 1 figure, 3 tables)

This paper contains 15 sections, 2 equations, 1 figure, 3 tables.

Introduction
Method
Phoneme-Level Variational Auto-Encoder
Phoneme-Level Latent Diffusion Model
Dual Classifier-Free Guidance for TTS
Inference
Experiments
Settings
Training
Dataset
Evaluation Metrics
Results
Subjective Evaluation
Objective Result
Conclusion

Figures (1)

Figure 1: Overall model architecture of DualSpeech. Trainable blocks are colored in yellow and pre-trained modules are colored in gray. All blocks are based on the Transformer encoder architecture, even if their architecture is not mentioned in the main text.

DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance

TL;DR

Abstract

DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance

Authors

TL;DR

Abstract

Table of Contents

Figures (1)