Table of Contents
Fetching ...

Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian Languages with Signal Processing aided Alignments

Anusha Prakash, S Umesh, Hema A Murthy

TL;DR

This work addresses the challenge of building high-quality end-to-end TTS for 13 Indian languages under low-resource conditions by enhancing duration modelling through signal-processing guided phone alignments. It introduces a Hybrid HMM-GD-DNN segmentation (HS) external aligner that fuses group-delay syllable boundaries and sub-band spectral flux with traditional alignments, integrated into a FastSpeech2 + HiFi-GAN pipeline. Across 25 TTS systems and 13 languages, HS generally matches or surpasses teacher-model, MFA, MAS, and VITS-based alignments in objective metrics and subjective perception, and outperforming the current ai4bharat_TTS_2023 models in most cases. The findings demonstrate the value of combining signal-processing cues with data-driven training to improve pronunciation and duration accuracy in low-resource multilingual TTS, with potential applicability to other E2E systems and prosodic features.

Abstract

End-to-end (E2E) systems synthesise high-quality speech, but this typically requires a large amount of data. As E2E synthesis progressed from Tacotron to FastSpeech2, it became evident that features representing prosody, particularly sub-word durations, are important for error-free synthesis. Variants of FastSpeech use a teacher model or forced alignments for training. This paper uses signal processing cues in tandem with forced alignment to produce accurate phone boundaries for the training data. As a result of better duration modelling, good-quality synthesisers are developed. Evaluations indicate that systems developed using the proposed signal processing-aided approach are better than systems developed using other alignment approaches, especially in low-resource scenarios. Our systems also outperform the existing best TTS systems available for 13 Indian languages.

Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian Languages with Signal Processing aided Alignments

TL;DR

This work addresses the challenge of building high-quality end-to-end TTS for 13 Indian languages under low-resource conditions by enhancing duration modelling through signal-processing guided phone alignments. It introduces a Hybrid HMM-GD-DNN segmentation (HS) external aligner that fuses group-delay syllable boundaries and sub-band spectral flux with traditional alignments, integrated into a FastSpeech2 + HiFi-GAN pipeline. Across 25 TTS systems and 13 languages, HS generally matches or surpasses teacher-model, MFA, MAS, and VITS-based alignments in objective metrics and subjective perception, and outperforming the current ai4bharat_TTS_2023 models in most cases. The findings demonstrate the value of combining signal-processing cues with data-driven training to improve pronunciation and duration accuracy in low-resource multilingual TTS, with potential applicability to other E2E systems and prosodic features.

Abstract

End-to-end (E2E) systems synthesise high-quality speech, but this typically requires a large amount of data. As E2E synthesis progressed from Tacotron to FastSpeech2, it became evident that features representing prosody, particularly sub-word durations, are important for error-free synthesis. Variants of FastSpeech use a teacher model or forced alignments for training. This paper uses signal processing cues in tandem with forced alignment to produce accurate phone boundaries for the training data. As a result of better duration modelling, good-quality synthesisers are developed. Evaluations indicate that systems developed using the proposed signal processing-aided approach are better than systems developed using other alignment approaches, especially in low-resource scenarios. Our systems also outperform the existing best TTS systems available for 13 Indian languages.
Paper Structure (15 sections, 3 figures, 6 tables)

This paper contains 15 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: An example of a Hindi waveform (bottom panel), its spectrogram (fourth panel), and phone-level alignments obtained from different techniques (top 3 panels). TS: teacher-student approach, MFA: Montreal forced aligner, HS: hybrid segmentation. The highlighted regions indicate the alignments in MFA and the correct alignments obtained using HS.
  • Figure 2: Spectrograms of synthesised utterances of Hindi male systems (with full data) using MFA (top) and HS (bottom) corresponding to the text "eek acchaa tariikaa".
  • Figure 3: Spectrograms of synthesised utterances of Hindi male systems (with 1 hour data) using MFA (top) and HS (bottom) corresponding to the text "sahi dxhang".