Table of Contents
Fetching ...

Cross-Utterance Conditioned VAE for Speech Generation

Yang Li, Cheng Yu, Guangzhi Sun, Weiqin Zu, Zheng Tian, Ying Wen, Wei Pan, Chao Zhang, Jun Wang, Yang Yang, Fanglei Sun

TL;DR

The paper addresses the challenge of expressive prosody and seamless editing in neural speech synthesis. It introduces the Cross-Utterance Conditioned Variational Autoencoder S2 (CUC-VAE S2), which leverages cross-utterance information via a Cross-Utterance Embedding and an utterance-specific prior within a CVAE, integrated with a FastSpeech 2 decoder and a HiFiGAN vocoder. Two practical algorithms, CUC-VAE TTS and CUC-VAE SE, enable context-aware text-to-speech generation and realistic speech editing, respectively, with evaluations on LibriTTS showing improvements in prosody diversity, naturalness, and intelligibility, and strong fidelity in editing tasks. By combining pre-trained language model-derived context with CVAE-based prosody modeling, the work offers a principled approach to context-aware speech synthesis and editing that could enhance multimedia production and post-editing workflows.

Abstract

Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.

Cross-Utterance Conditioned VAE for Speech Generation

TL;DR

The paper addresses the challenge of expressive prosody and seamless editing in neural speech synthesis. It introduces the Cross-Utterance Conditioned Variational Autoencoder S2 (CUC-VAE S2), which leverages cross-utterance information via a Cross-Utterance Embedding and an utterance-specific prior within a CVAE, integrated with a FastSpeech 2 decoder and a HiFiGAN vocoder. Two practical algorithms, CUC-VAE TTS and CUC-VAE SE, enable context-aware text-to-speech generation and realistic speech editing, respectively, with evaluations on LibriTTS showing improvements in prosody diversity, naturalness, and intelligibility, and strong fidelity in editing tasks. By combining pre-trained language model-derived context with CVAE-based prosody modeling, the work offers a principled approach to context-aware speech synthesis and editing that could enhance multimedia production and post-editing workflows.

Abstract

Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.
Paper Structure (26 sections, 15 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 26 sections, 15 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: An overview of the Cross-Utterance Conditioned VAE Speech Synthesis (CUC-VAE S2) Framework architecture. The primary CUC-VAE synthesizer utilizes textual information derived from neighboring text via the Cross-Utterance (CU) embedding, as well as audio information processed by the pre-processing module. A supplementary vocoder is incorporated with the purpose of transforming the synthesized mel-spectrogram into waveform.
  • Figure 2: A comprehensive overview of the practical CUC-VAE TTS algorithm.
  • Figure 3: A comprehensive overview of the practical CUC-VAE SE algorithm.
  • Figure 4: The mel-spectrograms of target speech and speech edited by EditSpeech, our system with unbiased training(loss ratio=1:1), and our system with biased training(loss ratio=1:1.5). The region marked with time (0.62s $\sim$ 1.15s) is the edited region.
  • Figure 5: Comparisons between the energy and pitch contour of same text “Mary asked the time" but different neighbouring utterances, generated by CUC-VAE TTS module.