Table of Contents
Fetching ...

Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation

Changjin Han, Seokgi Lee, Gyuhyeon Nam, Gyeongsu Chae

TL;DR

This work addresses mispronunciation in diffusion-based zero-shot TTS by identifying diffusion-driven degradation of phonetic signals and proposing StableForm-TTS. The approach integrates source-filter theory with a decomposed variance adaptor, applying diffusion only to the excitation pathway while keeping formants deterministic, facilitated by a style-aware linguistic encoder and SALN-driven modules. Empirical results on unseen speakers show improved pronunciation accuracy and naturalness with competitive or superior speaker similarity, along with strong scalability that reduces parameter counts. The combination of variance-based priors and explicit source-filter modeling offers a practical path to robust, diffusion-based zero-shot synthesis at scale.

Abstract

Diffusion models have achieved remarkable success in text-to-speech (TTS), even in zero-shot scenarios. Recent efforts aim to address the trade-off between inference speed and sound quality, often considered the primary drawback of diffusion models. However, we find a critical mispronunciation issue is being overlooked. Our preliminary study reveals the unstable pronunciation resulting from the diffusion process. Based on this observation, we introduce StableForm-TTS, a novel zero-shot speech synthesis framework designed to produce robust pronunciation while maintaining the advantages of diffusion modeling. By pioneering the adoption of source-filter theory in diffusion TTS, we propose an elaborate architecture for stable formant generation. Experimental results on unseen speakers show that our model outperforms the state-of-the-art method in terms of pronunciation accuracy and naturalness, with comparable speaker similarity. Moreover, our model demonstrates effective scalability as both data and model sizes increase. Audio samples are available online: https://deepbrainai-research.github.io/stableformtts/.

Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation

TL;DR

This work addresses mispronunciation in diffusion-based zero-shot TTS by identifying diffusion-driven degradation of phonetic signals and proposing StableForm-TTS. The approach integrates source-filter theory with a decomposed variance adaptor, applying diffusion only to the excitation pathway while keeping formants deterministic, facilitated by a style-aware linguistic encoder and SALN-driven modules. Empirical results on unseen speakers show improved pronunciation accuracy and naturalness with competitive or superior speaker similarity, along with strong scalability that reduces parameter counts. The combination of variance-based priors and explicit source-filter modeling offers a practical path to robust, diffusion-based zero-shot synthesis at scale.

Abstract

Diffusion models have achieved remarkable success in text-to-speech (TTS), even in zero-shot scenarios. Recent efforts aim to address the trade-off between inference speed and sound quality, often considered the primary drawback of diffusion models. However, we find a critical mispronunciation issue is being overlooked. Our preliminary study reveals the unstable pronunciation resulting from the diffusion process. Based on this observation, we introduce StableForm-TTS, a novel zero-shot speech synthesis framework designed to produce robust pronunciation while maintaining the advantages of diffusion modeling. By pioneering the adoption of source-filter theory in diffusion TTS, we propose an elaborate architecture for stable formant generation. Experimental results on unseen speakers show that our model outperforms the state-of-the-art method in terms of pronunciation accuracy and naturalness, with comparable speaker similarity. Moreover, our model demonstrates effective scalability as both data and model sizes increase. Audio samples are available online: https://deepbrainai-research.github.io/stableformtts/.
Paper Structure (20 sections, 6 equations, 3 figures, 3 tables)

This paper contains 20 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Line charts of CER ratio ($\frac{CER_{n}}{CER_{0}}, n\in[0,100]$) against reverse steps for each diffusion TTS model. Capital letters attached to model names stand for S: single speaker, M: multi speaker, and ZS: zero-shot, respectively. We denote solver and train dataset names in parentheses as abbreviations.
  • Figure 2: Overall architecture of StableForm-TTS. For brevity, the phoneme, pitch, and energy embedding layers are omitted.
  • Figure 3: Visual comparison. The text beneath the mel-spectrogram indicates the ASR model's transcription of the area enclosed within the red dotted box.