Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration
Haowei Lou, Helen Paik, Wen Hu, Lina Yao
TL;DR
This work addresses the dependency of text-to-speech systems on externally generated phoneme durations by introducing an Aligner-Guided Training Paradigm that first trains an aligner to produce accurate duration labels $L$ from acoustic features, thereby guiding the TTS training process. The method fuses an ASR-based aligner with a PDA to derive $L$ from a frame-level likelihood $C$, and trains a StyleSpeech-based TTS using $X$, $L$, and a flexible target $Y$, with a duration adapter aligning embeddings to the predicted durations. Across experiments on the Baker Chinese dataset, Mel-Spectrogram features yield the best performance, achieving up to ~16% improvements in overall WER (and substantial gains in WER-P and WER-S) over MFA-based baselines, while MFCCs and latent features offer progressively weaker gains. The results underscore the importance of accurate duration labeling for natural prosody and show that reducing reliance on external alignment tools can enhance TTS naturalness and intelligibility in practical settings.
Abstract
Recent advancements in text-to-speech (TTS) systems, such as FastSpeech and StyleSpeech, have significantly improved speech generation quality. However, these models often rely on duration generated by external tools like the Montreal Forced Aligner, which can be time-consuming and lack flexibility. The importance of accurate duration is often underestimated, despite their crucial role in achieving natural prosody and intelligibility. To address these limitations, we propose a novel Aligner-Guided Training Paradigm that prioritizes accurate duration labelling by training an aligner before the TTS model. This approach reduces dependence on external tools and enhances alignment accuracy. We further explore the impact of different acoustic features, including Mel-Spectrograms, MFCCs, and latent features, on TTS model performance. Our experimental results show that aligner-guided duration labelling can achieve up to a 16\% improvement in word error rate and significantly enhance phoneme and tone alignment. These findings highlight the effectiveness of our approach in optimizing TTS systems for more natural and intelligible speech generation.
