E1 TTS: Simple and Fast Non-Autoregressive TTS

Zhijun Liu; Shuai Wang; Pengcheng Zhu; Mengxiao Bi; Haizhou Li

E1 TTS: Simple and Fast Non-Autoregressive TTS

Zhijun Liu, Shuai Wang, Pengcheng Zhu, Mengxiao Bi, Haizhou Li

TL;DR

E1 TTS is an efficient non-autoregressive zero-shot text-to-speech system based on denoising diffusion pretraining and distribution matching distillation that achieves naturalness and speaker similarity comparable to various strong baseline models.

Abstract

This paper introduces Easy One-Step Text-to-Speech (E1 TTS), an efficient non-autoregressive zero-shot text-to-speech system based on denoising diffusion pretraining and distribution matching distillation. The training of E1 TTS is straightforward; it does not require explicit monotonic alignment between the text and audio pairs. The inference of E1 TTS is efficient, requiring only one neural network evaluation for each utterance. Despite its sampling efficiency, E1 TTS achieves naturalness and speaker similarity comparable to various strong baseline models. Audio samples are available at http://e1tts.github.io/ .

E1 TTS: Simple and Fast Non-Autoregressive TTS

TL;DR

Abstract

Paper Structure (18 sections, 6 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 6 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Background
Distribution Matching Distillation
Rectified Flow
E1 TTS
The Mel Spectrogram Autoencoder
Text-to-Token Diffusion Transformer
Duration Modeling
Inference
Experiments and Results
Setup
Training Details
Zero-Shot Text-to-Speech
Text-based Speech Inpainting
Robustness to Different Speech Rate
...and 3 more sections

Figures (5)

Figure 1: Distribution matching distillation (DMD) of diffusion models is summarized in this overview. The pretrained score estimator serves to initialize both the one-step generator and the score estimator for the generated samples. Following initialization, the generator is optimized using DMD in a manner analogous to adversarial training.
Figure 2: An overview of the E1 TTS inference pipeline in prompted text-to-speech: (1) The reference speech Mel spectrogram is encoded into speech tokens. (2) A Diffusion Transformer (DiT) generates all speech tokens given the prompt speech tokens and the prompt and target text. (3) Another DiT model generates the Mel spectrogram given the generated speech tokens. (4) A neural vocoder converts the input Mel spectrogram to the target waveform.
Figure 3: Illustration of the Text-to-Token Diffusion Transformer performing text-based speech editing. The model takes concatenated text and noised speech tokens as input, and predicts the masked speech tokens for the replaced text by predicting the score function. The model implicitly aligns text and speech modalities without token-to-token alignment information.
Figure 4: Token position indices in the Text-to-Token DiT.
Figure 5: WER and SECS of zero-shot text-to-speech with E1 TTS when scaling the predicted total duration by different factors.

E1 TTS: Simple and Fast Non-Autoregressive TTS

TL;DR

Abstract

E1 TTS: Simple and Fast Non-Autoregressive TTS

Authors

TL;DR

Abstract

Table of Contents

Figures (5)