Zero-Shot TTS With Enhanced Audio Prompts: Bsc Submission For The 2026 Wildspoof Challenge TTS Track

Jose Giraldo; Alex Peiró-Lilja; Rodolfo Zevallos; Cristina España-Bonet

Zero-Shot TTS With Enhanced Audio Prompts: Bsc Submission For The 2026 Wildspoof Challenge TTS Track

Jose Giraldo, Alex Peiró-Lilja, Rodolfo Zevallos, Cristina España-Bonet

TL;DR

The work tackles zero-shot TTS for spontaneous, in-the-wild speech by combining a Sidon-based speech enhancement pipeline with two non-autoregressive TTS models (StyleTTS2 and F5-TTS) that incorporate flexible duration modeling. Finetuning on the TITW Easy partition and systematic analysis of reference audio prompts yield robust results, with Sidon-enhanced data achieving up to $4.21$ UTMOS and $3.47$ DNSMOS, and prompt quality/length strongly impacting speaker similarity and intelligibility (WER). The best-performing setup uses F5-TTS with enhanced prompts, underscoring the importance of prompt design for zero-shot synthesis in realistic, noisy conditions. This work demonstrates practical zero-shot TTS viability in wild data and highlights directions for adaptive audio enhancement to balance quality with speaker identity.

Abstract

We evaluate two non-autoregressive architectures, StyleTTS2 and F5-TTS, to address the spontaneous nature of in-the-wild speech. Our models utilize flexible duration modeling to improve prosodic naturalness. To handle acoustic noise, we implement a multi-stage enhancement pipeline using the Sidon model, which significantly outperforms standard Demucs in signal quality. Experimental results show that finetuning enhanced audios yields superior robustness, achieving up to 4.21 UTMOS and 3.47 DNSMOS. Furthermore, we analyze the impact of reference prompt quality and length on zero-shot synthesis performance, demonstrating the effectiveness of our approach for realistic speech generation.

Zero-Shot TTS With Enhanced Audio Prompts: Bsc Submission For The 2026 Wildspoof Challenge TTS Track

TL;DR

UTMOS and

DNSMOS, and prompt quality/length strongly impacting speaker similarity and intelligibility (WER). The best-performing setup uses F5-TTS with enhanced prompts, underscoring the importance of prompt design for zero-shot synthesis in realistic, noisy conditions. This work demonstrates practical zero-shot TTS viability in wild data and highlights directions for adaptive audio enhancement to balance quality with speaker identity.

Abstract

Paper Structure (5 sections, 1 figure, 2 tables)

This paper contains 5 sections, 1 figure, 2 tables.

Introduction
Experiments
Results
Conclusion and future work
Acknowledgements

Figures (1)

Figure 1: Spectrogram comparison of models with different input audio prompts

Zero-Shot TTS With Enhanced Audio Prompts: Bsc Submission For The 2026 Wildspoof Challenge TTS Track

TL;DR

Abstract

Zero-Shot TTS With Enhanced Audio Prompts: Bsc Submission For The 2026 Wildspoof Challenge TTS Track

Authors

TL;DR

Abstract

Table of Contents

Figures (1)