PAVITS: Exploring Prosody-aware VITS for End-to-End Emotional Voice Conversion

Tianhua Qi; Wenming Zheng; Cheng Lu; Yuan Zong; Hailun Lian

PAVITS: Exploring Prosody-aware VITS for End-to-End Emotional Voice Conversion

Tianhua Qi, Wenming Zheng, Cheng Lu, Yuan Zong, Hailun Lian

Abstract

In this paper, we propose Prosody-aware VITS (PAVITS) for emotional voice conversion (EVC), aiming to achieve two major objectives of EVC: high content naturalness and high emotional naturalness, which are crucial for meeting the demands of human perception. To improve the content naturalness of converted audio, we have developed an end-to-end EVC architecture inspired by the high audio quality of VITS. By seamlessly integrating an acoustic converter and vocoder, we effectively address the common issue of mismatch between emotional prosody training and run-time conversion that is prevalent in existing EVC models. To further enhance the emotional naturalness, we introduce an emotion descriptor to model the subtle prosody variations of different speech emotions. Additionally, we propose a prosody predictor, which predicts prosody features from text based on the provided emotion label. Notably, we introduce a prosody alignment loss to establish a connection between latent prosody features from two distinct modalities, ensuring effective training. Experimental results show that the performance of PAVITS is superior to the state-of-the-art EVC methods. Speech Samples are available at https://jeremychee4.github.io/pavits4EVC/ .

PAVITS: Exploring Prosody-aware VITS for End-to-End Emotional Voice Conversion

Abstract

Paper Structure (14 sections, 10 equations, 3 figures, 3 tables)

This paper contains 14 sections, 10 equations, 3 figures, 3 tables.

Introduction
Proposed method
Textual prosody prediction module
Acoustic prosody modeling module
Information alignment module
Emotional speech synthesis module
Final loss
Run-time conversion
Experiments
Dataset
Experimental Setup
Results & Discussion
Ablation Study
Conclusion

Figures (3)

Figure 1: Architecture of PAVITS.
Figure 2: Emotional similarity test with 95% confidence interval following zhou2021limited.
Figure 3: Spectrogram of a testing clip (happy), from top to bottom are ground truth, converted by original VITS, and proposed PAVITS.

PAVITS: Exploring Prosody-aware VITS for End-to-End Emotional Voice Conversion

Abstract

PAVITS: Exploring Prosody-aware VITS for End-to-End Emotional Voice Conversion

Authors

Abstract

Table of Contents

Figures (3)