Table of Contents
Fetching ...

Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models

Minki Kang, Dongchan Min, Sung Ju Hwang

TL;DR

Grad-StyleSpeech tackles zero-shot any-speaker adaptive TTS by conditioning a diffusion-based generator on a target speaker's style extracted from a brief reference audio. It integrates a Mel-Style encoder, a hierarchical transformer encoder, and a score-based diffusion model to produce mel-spectrograms that closely match the target voice. The approach outperforms recent baselines on LibriTTS and VCTK in both objective and subjective measures, with ablations highlighting the value of the hierarchical encoder and diffusion prior. The work enables practical, high-fidelity voice cloning with minimal reference data and provides accessible demo audio.

Abstract

There has been a significant progress in Text-To-Speech (TTS) synthesis technology in recent years, thanks to the advancement in neural generative modeling. However, existing methods on any-speaker adaptive TTS have achieved unsatisfactory performance, due to their suboptimal accuracy in mimicking the target speakers' styles. In this work, we present Grad-StyleSpeech, which is an any-speaker adaptive TTS framework that is based on a diffusion model that can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech. Grad-StyleSpeech significantly outperforms recent speaker-adaptive TTS baselines on English benchmarks. Audio samples are available at https://nardien.github.io/grad-stylespeech-demo.

Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models

TL;DR

Grad-StyleSpeech tackles zero-shot any-speaker adaptive TTS by conditioning a diffusion-based generator on a target speaker's style extracted from a brief reference audio. It integrates a Mel-Style encoder, a hierarchical transformer encoder, and a score-based diffusion model to produce mel-spectrograms that closely match the target voice. The approach outperforms recent baselines on LibriTTS and VCTK in both objective and subjective measures, with ablations highlighting the value of the hierarchical encoder and diffusion prior. The work enables practical, high-fidelity voice cloning with minimal reference data and provides accessible demo audio.

Abstract

There has been a significant progress in Text-To-Speech (TTS) synthesis technology in recent years, thanks to the advancement in neural generative modeling. However, existing methods on any-speaker adaptive TTS have achieved unsatisfactory performance, due to their suboptimal accuracy in mimicking the target speakers' styles. In this work, we present Grad-StyleSpeech, which is an any-speaker adaptive TTS framework that is based on a diffusion model that can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech. Grad-StyleSpeech significantly outperforms recent speaker-adaptive TTS baselines on English benchmarks. Audio samples are available at https://nardien.github.io/grad-stylespeech-demo.
Paper Structure (16 sections, 6 equations, 3 figures, 2 tables)

This paper contains 16 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Framework Overview. Blue box indicates a hierarchical transformer encoder where it outputs $\bm{\mu}$.
  • Figure 2: Qualitative Visualization. We visualize the synthesized mel-spectrograms from our model (a) before the diffusion models and (b) after the diffusion models. We use the same duration with regards to the (c) ground truth speech.
  • Figure 3: Objective evaluation for few-shot fine-tuning. We plot SECS and CER varying fine-tuning steps on VCTK dataset. SAE and Diff indicate that we fine-tune the parameters of Style-Adaptive Encoder and Diffusion model, respectively.