Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models
Minki Kang, Dongchan Min, Sung Ju Hwang
TL;DR
Grad-StyleSpeech tackles zero-shot any-speaker adaptive TTS by conditioning a diffusion-based generator on a target speaker's style extracted from a brief reference audio. It integrates a Mel-Style encoder, a hierarchical transformer encoder, and a score-based diffusion model to produce mel-spectrograms that closely match the target voice. The approach outperforms recent baselines on LibriTTS and VCTK in both objective and subjective measures, with ablations highlighting the value of the hierarchical encoder and diffusion prior. The work enables practical, high-fidelity voice cloning with minimal reference data and provides accessible demo audio.
Abstract
There has been a significant progress in Text-To-Speech (TTS) synthesis technology in recent years, thanks to the advancement in neural generative modeling. However, existing methods on any-speaker adaptive TTS have achieved unsatisfactory performance, due to their suboptimal accuracy in mimicking the target speakers' styles. In this work, we present Grad-StyleSpeech, which is an any-speaker adaptive TTS framework that is based on a diffusion model that can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech. Grad-StyleSpeech significantly outperforms recent speaker-adaptive TTS baselines on English benchmarks. Audio samples are available at https://nardien.github.io/grad-stylespeech-demo.
