FlashSpeech: Efficient Zero-Shot Speech Synthesis

Zhen Ye; Zeqian Ju; Haohe Liu; Xu Tan; Jianyi Chen; Yiwen Lu; Peiwen Sun; Jiahao Pan; Weizhen Bian; Shulin He; Wei Xue; Qifeng Liu; Yike Guo

FlashSpeech: Efficient Zero-Shot Speech Synthesis

Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He, Wei Xue, Qifeng Liu, Yike Guo

TL;DR

FlashSpeech is a large-scale zero-shot speech synthesis system with approximately 5% of the inference time compared with previous work that is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher.

Abstract

Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in https://flashspeech.github.io/.

FlashSpeech: Efficient Zero-Shot Speech Synthesis

TL;DR

Abstract

Paper Structure (37 sections, 18 equations, 5 figures, 6 tables)

This paper contains 37 sections, 18 equations, 5 figures, 6 tables.

Introduction
Related work
Large-Scale Speech Synthesis
Acceleration of Speech Synthesis
Consistency Model
FlashSpeech
Overview
Latent Consistency Model
Adversarial Consistency Training
Consistency Training
Adversarial Training
Combined Together
Prosody Generator
Analysis of Prosody Prediction
Prosody Refinement via Consistency Model
...and 22 more sections

Figures (5)

Figure 1: The inference time comparisons of different zero-shot speech synthesis systems using the real-time factor (RTF).
Figure 2: Overall architecture of FlashSpeech. Our FlashSpeech consists of a codec encoder/decoder and a latent consistency model conditioned on feature from a phoneme and $\mathbf{z}_{prompt}$ encoder and a prosody generator. A discriminator is used during training.
Figure 3: An illustration of adversarial consistency training.
Figure 4: An illustration of prosody generator.
Figure 5: User preference study. We compare the audio quality and speaker similarity of FlashSpeech against baselines with their official demo.

FlashSpeech: Efficient Zero-Shot Speech Synthesis

TL;DR

Abstract

FlashSpeech: Efficient Zero-Shot Speech Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (5)