Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian Languages with Signal Processing aided Alignments
Anusha Prakash, S Umesh, Hema A Murthy
TL;DR
This work addresses the challenge of building high-quality end-to-end TTS for 13 Indian languages under low-resource conditions by enhancing duration modelling through signal-processing guided phone alignments. It introduces a Hybrid HMM-GD-DNN segmentation (HS) external aligner that fuses group-delay syllable boundaries and sub-band spectral flux with traditional alignments, integrated into a FastSpeech2 + HiFi-GAN pipeline. Across 25 TTS systems and 13 languages, HS generally matches or surpasses teacher-model, MFA, MAS, and VITS-based alignments in objective metrics and subjective perception, and outperforming the current ai4bharat_TTS_2023 models in most cases. The findings demonstrate the value of combining signal-processing cues with data-driven training to improve pronunciation and duration accuracy in low-resource multilingual TTS, with potential applicability to other E2E systems and prosodic features.
Abstract
End-to-end (E2E) systems synthesise high-quality speech, but this typically requires a large amount of data. As E2E synthesis progressed from Tacotron to FastSpeech2, it became evident that features representing prosody, particularly sub-word durations, are important for error-free synthesis. Variants of FastSpeech use a teacher model or forced alignments for training. This paper uses signal processing cues in tandem with forced alignment to produce accurate phone boundaries for the training data. As a result of better duration modelling, good-quality synthesisers are developed. Evaluations indicate that systems developed using the proposed signal processing-aided approach are better than systems developed using other alignment approaches, especially in low-resource scenarios. Our systems also outperform the existing best TTS systems available for 13 Indian languages.
