Table of Contents
Fetching ...

E1 TTS: Simple and Fast Non-Autoregressive TTS

Zhijun Liu, Shuai Wang, Pengcheng Zhu, Mengxiao Bi, Haizhou Li

TL;DR

E1 TTS is an efficient non-autoregressive zero-shot text-to-speech system based on denoising diffusion pretraining and distribution matching distillation that achieves naturalness and speaker similarity comparable to various strong baseline models.

Abstract

This paper introduces Easy One-Step Text-to-Speech (E1 TTS), an efficient non-autoregressive zero-shot text-to-speech system based on denoising diffusion pretraining and distribution matching distillation. The training of E1 TTS is straightforward; it does not require explicit monotonic alignment between the text and audio pairs. The inference of E1 TTS is efficient, requiring only one neural network evaluation for each utterance. Despite its sampling efficiency, E1 TTS achieves naturalness and speaker similarity comparable to various strong baseline models. Audio samples are available at http://e1tts.github.io/ .

E1 TTS: Simple and Fast Non-Autoregressive TTS

TL;DR

E1 TTS is an efficient non-autoregressive zero-shot text-to-speech system based on denoising diffusion pretraining and distribution matching distillation that achieves naturalness and speaker similarity comparable to various strong baseline models.

Abstract

This paper introduces Easy One-Step Text-to-Speech (E1 TTS), an efficient non-autoregressive zero-shot text-to-speech system based on denoising diffusion pretraining and distribution matching distillation. The training of E1 TTS is straightforward; it does not require explicit monotonic alignment between the text and audio pairs. The inference of E1 TTS is efficient, requiring only one neural network evaluation for each utterance. Despite its sampling efficiency, E1 TTS achieves naturalness and speaker similarity comparable to various strong baseline models. Audio samples are available at http://e1tts.github.io/ .
Paper Structure (18 sections, 6 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 6 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Distribution matching distillation (DMD) of diffusion models is summarized in this overview. The pretrained score estimator serves to initialize both the one-step generator and the score estimator for the generated samples. Following initialization, the generator is optimized using DMD in a manner analogous to adversarial training.
  • Figure 2: An overview of the E1 TTS inference pipeline in prompted text-to-speech: (1) The reference speech Mel spectrogram is encoded into speech tokens. (2) A Diffusion Transformer (DiT) generates all speech tokens given the prompt speech tokens and the prompt and target text. (3) Another DiT model generates the Mel spectrogram given the generated speech tokens. (4) A neural vocoder converts the input Mel spectrogram to the target waveform.
  • Figure 3: Illustration of the Text-to-Token Diffusion Transformer performing text-based speech editing. The model takes concatenated text and noised speech tokens as input, and predicts the masked speech tokens for the replaced text by predicting the score function. The model implicitly aligns text and speech modalities without token-to-token alignment information.
  • Figure 4: Token position indices in the Text-to-Token DiT.
  • Figure 5: WER and SECS of zero-shot text-to-speech with E1 TTS when scaling the predicted total duration by different factors.