E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, Naoyuki Kanda
TL;DR
E2 TTS demonstrates that a fully non-autoregressive zero-shot TTS system can achieve human-like naturalness and strong speaker similarity without duration models or grapheme-to-phoneme tooling. By formulating text as a character sequence augmented with filler tokens and training a flow-matching-based mel-spectrogram generator on speech infilling, the approach delivers state-of-the-art zero-shot performance while maintaining simplicity. The paper shows strong objective and subjective results against Voicebox, VALL-E, and NaturalSpeech 3, and introduces practical extensions (X1 and X2) to remove transcription requirements and enable pronunciation control, with excellent scalability evidenced by larger training datasets. Overall, E2 TTS offers a compact, extensible framework that rivals complex autoregressive and diffusion-based models in zero-shot TTS tasks and enables flexible input representations and user controls.
Abstract
This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the audio infilling task. Unlike many previous works, it does not require additional components (e.g., duration model, grapheme-to-phoneme) or complex techniques (e.g., monotonic alignment search). Despite its simplicity, E2 TTS achieves state-of-the-art zero-shot TTS capabilities that are comparable to or surpass previous works, including Voicebox and NaturalSpeech 3. The simplicity of E2 TTS also allows for flexibility in the input representation. We propose several variants of E2 TTS to improve usability during inference. See https://aka.ms/e2tts/ for demo samples.
