Table of Contents
Fetching ...

Sample-Efficient Diffusion for Text-To-Speech Synthesis

Justin Lovelace, Soham Ray, Kwangyoun Kim, Kilian Q. Weinberger, Felix Wu

TL;DR

The paper addresses the data limitations of state-of-the-art TTS by introducing a latent-diffusion framework (SESD) trained on EnCodec latent embeddings to achieve efficient synthesis with minimal labeled data. Central innovations include the U-Audio Transformer that combines a 1D U-Net with a transformer backbone, position-aware cross-attention to ByT5-based text representations, and an asymmetric diffusion loss weighting that emphasizes high-noise regimes for better transcript alignment. Results show SESD attains a text-only WER of 2.3% (near the human 2.2%) and a speaker-prompted WER of 2.3% with a speaker-similarity of 0.617, while using less than 2% of the data required by strong baselines like VALL-E, demonstrating substantial data efficiency. Overall, SESD significantly reduces annotation requirements for high-quality TTS and demonstrates strong performance in both text-only and speaker-conditioned scenarios through a novel latent-diffusion and conditioning strategy.

Abstract

This work introduces Sample-Efficient Speech Diffusion (SESD), an algorithm for effective speech synthesis in modest data regimes through latent diffusion. It is based on a novel diffusion architecture, that we call U-Audio Transformer (U-AT), that efficiently scales to long sequences and operates in the latent space of a pre-trained audio autoencoder. Conditioned on character-aware language model representations, SESD achieves impressive results despite training on less than 1k hours of speech - far less than current state-of-the-art systems. In fact, it synthesizes more intelligible speech than the state-of-the-art auto-regressive model, VALL-E, while using less than 2% the training data.

Sample-Efficient Diffusion for Text-To-Speech Synthesis

TL;DR

The paper addresses the data limitations of state-of-the-art TTS by introducing a latent-diffusion framework (SESD) trained on EnCodec latent embeddings to achieve efficient synthesis with minimal labeled data. Central innovations include the U-Audio Transformer that combines a 1D U-Net with a transformer backbone, position-aware cross-attention to ByT5-based text representations, and an asymmetric diffusion loss weighting that emphasizes high-noise regimes for better transcript alignment. Results show SESD attains a text-only WER of 2.3% (near the human 2.2%) and a speaker-prompted WER of 2.3% with a speaker-similarity of 0.617, while using less than 2% of the data required by strong baselines like VALL-E, demonstrating substantial data efficiency. Overall, SESD significantly reduces annotation requirements for high-quality TTS and demonstrates strong performance in both text-only and speaker-conditioned scenarios through a novel latent-diffusion and conditioning strategy.

Abstract

This work introduces Sample-Efficient Speech Diffusion (SESD), an algorithm for effective speech synthesis in modest data regimes through latent diffusion. It is based on a novel diffusion architecture, that we call U-Audio Transformer (U-AT), that efficiently scales to long sequences and operates in the latent space of a pre-trained audio autoencoder. Conditioned on character-aware language model representations, SESD achieves impressive results despite training on less than 1k hours of speech - far less than current state-of-the-art systems. In fact, it synthesizes more intelligible speech than the state-of-the-art auto-regressive model, VALL-E, while using less than 2% the training data.
Paper Structure (7 sections, 5 equations, 4 figures, 1 table)

This paper contains 7 sections, 5 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of our Sample-Efficient Speech Diffusion architecture.
  • Figure 2: Diffusion loss weighting across noise levels. We allocate significant weight to higher levels of noise to improve transcript alignment.
  • Figure 3: Speaker-prompted performance across dataset sizes. We display the relative size of the training dataset for each method.
  • Figure 4: Ablation studies.