Zero-Shot TTS With Enhanced Audio Prompts: Bsc Submission For The 2026 Wildspoof Challenge TTS Track
Jose Giraldo, Alex Peiró-Lilja, Rodolfo Zevallos, Cristina España-Bonet
TL;DR
The work tackles zero-shot TTS for spontaneous, in-the-wild speech by combining a Sidon-based speech enhancement pipeline with two non-autoregressive TTS models (StyleTTS2 and F5-TTS) that incorporate flexible duration modeling. Finetuning on the TITW Easy partition and systematic analysis of reference audio prompts yield robust results, with Sidon-enhanced data achieving up to $4.21$ UTMOS and $3.47$ DNSMOS, and prompt quality/length strongly impacting speaker similarity and intelligibility (WER). The best-performing setup uses F5-TTS with enhanced prompts, underscoring the importance of prompt design for zero-shot synthesis in realistic, noisy conditions. This work demonstrates practical zero-shot TTS viability in wild data and highlights directions for adaptive audio enhancement to balance quality with speaker identity.
Abstract
We evaluate two non-autoregressive architectures, StyleTTS2 and F5-TTS, to address the spontaneous nature of in-the-wild speech. Our models utilize flexible duration modeling to improve prosodic naturalness. To handle acoustic noise, we implement a multi-stage enhancement pipeline using the Sidon model, which significantly outperforms standard Demucs in signal quality. Experimental results show that finetuning enhanced audios yields superior robustness, achieving up to 4.21 UTMOS and 3.47 DNSMOS. Furthermore, we analyze the impact of reference prompt quality and length on zero-shot synthesis performance, demonstrating the effectiveness of our approach for realistic speech generation.
