Sample-Efficient Diffusion for Text-To-Speech Synthesis
Justin Lovelace, Soham Ray, Kwangyoun Kim, Kilian Q. Weinberger, Felix Wu
TL;DR
The paper addresses the data limitations of state-of-the-art TTS by introducing a latent-diffusion framework (SESD) trained on EnCodec latent embeddings to achieve efficient synthesis with minimal labeled data. Central innovations include the U-Audio Transformer that combines a 1D U-Net with a transformer backbone, position-aware cross-attention to ByT5-based text representations, and an asymmetric diffusion loss weighting that emphasizes high-noise regimes for better transcript alignment. Results show SESD attains a text-only WER of 2.3% (near the human 2.2%) and a speaker-prompted WER of 2.3% with a speaker-similarity of 0.617, while using less than 2% of the data required by strong baselines like VALL-E, demonstrating substantial data efficiency. Overall, SESD significantly reduces annotation requirements for high-quality TTS and demonstrates strong performance in both text-only and speaker-conditioned scenarios through a novel latent-diffusion and conditioning strategy.
Abstract
This work introduces Sample-Efficient Speech Diffusion (SESD), an algorithm for effective speech synthesis in modest data regimes through latent diffusion. It is based on a novel diffusion architecture, that we call U-Audio Transformer (U-AT), that efficiently scales to long sequences and operates in the latent space of a pre-trained audio autoencoder. Conditioned on character-aware language model representations, SESD achieves impressive results despite training on less than 1k hours of speech - far less than current state-of-the-art systems. In fact, it synthesizes more intelligible speech than the state-of-the-art auto-regressive model, VALL-E, while using less than 2% the training data.
