Table of Contents
Fetching ...

Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track

June Young Yi, Hyeongju Kim, Juheon Lee

TL;DR

The paper tackles robust TTS in the presence of in-the-wild label noise and misalignment. It extends a lightweight flow-matching TTS (Supertonic) with Self-Purifying Flow Matching to selectively use unreliable samples for unconditional training, improving pronunciation fidelity while preserving acoustic quality. The approach achieves the lowest WER among WildSpoof participants and maintains strong perceptual metrics, demonstrating that open-weight, efficient architectures can adapt to real-world noisy conditions when paired with explicit noise-handling strategies. This provides a practical, scalable path for deploying robust TTS in unconstrained environments.

Abstract

This paper presents a lightweight text-to-speech (TTS) system developed for the WildSpoof Challenge TTS Track. Our approach fine-tunes the recently released open-weight TTS model, \textit{Supertonic}\footnote{\url{https://github.com/supertone-inc/supertonic}}, with Self-Purifying Flow Matching (SPFM) to enable robust adaptation to in-the-wild speech. SPFM mitigates label noise by comparing conditional and unconditional flow matching losses on each sample, routing suspicious text--speech pairs to unconditional training while still leveraging their acoustic information. The resulting model achieves the lowest Word Error Rate (WER) among all participating teams, while ranking second in perceptual metrics such as UTMOS and DNSMOS. These findings demonstrate that efficient, open-weight architectures like Supertonic can be effectively adapted to diverse real-world speech conditions when combined with explicit noise-handling mechanisms such as SPFM.

Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track

TL;DR

The paper tackles robust TTS in the presence of in-the-wild label noise and misalignment. It extends a lightweight flow-matching TTS (Supertonic) with Self-Purifying Flow Matching to selectively use unreliable samples for unconditional training, improving pronunciation fidelity while preserving acoustic quality. The approach achieves the lowest WER among WildSpoof participants and maintains strong perceptual metrics, demonstrating that open-weight, efficient architectures can adapt to real-world noisy conditions when paired with explicit noise-handling strategies. This provides a practical, scalable path for deploying robust TTS in unconstrained environments.

Abstract

This paper presents a lightweight text-to-speech (TTS) system developed for the WildSpoof Challenge TTS Track. Our approach fine-tunes the recently released open-weight TTS model, \textit{Supertonic}\footnote{\url{https://github.com/supertone-inc/supertonic}}, with Self-Purifying Flow Matching (SPFM) to enable robust adaptation to in-the-wild speech. SPFM mitigates label noise by comparing conditional and unconditional flow matching losses on each sample, routing suspicious text--speech pairs to unconditional training while still leveraging their acoustic information. The resulting model achieves the lowest Word Error Rate (WER) among all participating teams, while ranking second in perceptual metrics such as UTMOS and DNSMOS. These findings demonstrate that efficient, open-weight architectures like Supertonic can be effectively adapted to diverse real-world speech conditions when combined with explicit noise-handling mechanisms such as SPFM.

Paper Structure

This paper contains 7 sections, 2 equations, 2 tables.