Phir Hera Fairy: An English Fairytaler is a Strong Faker of Fluent Speech in Low-Resource Indian Languages
Praveen Srinivasa Varadhan, Srija Anand, Soma Siddhartha, Mitesh M. Khapra
TL;DR
This work investigates translating a high-capacity English TTS model (F5-TTS) to 11 low-resource Indian languages by comparing train-from-scratch, English-pretrained fine-tuning, and mixed-language fine-tuning. It introduces a 685-token Indian-script character vocabulary to enable pure character-based modeling, and demonstrates that fine-tuning on Indian data alone yields the strongest multilingual synthesis, while English pretraining provides a robust prior that benefits data-scarce scenarios. The study reveals emergent capabilities—polyglot speech, cross-language transfer, code-mixed intelligibility, and expressive synthesis—along with scalable zero-resource TTS through synthetic data and human-in-the-loop refinement, achieving competitive MUSHRA scores. The results reach new state-of-the-art levels for Indian languages, outperforming prior baselines and enabling practical TTS expansion to underrepresented languages with constrained data and compute.
Abstract
What happens when an English Fairytaler is fine-tuned on Indian languages? We evaluate how the English F5-TTS model adapts to 11 Indian languages, measuring polyglot fluency, voice-cloning, style-cloning, and code-mixing. We compare: (i) training from scratch, (ii) fine-tuning English F5 on Indian data, and (iii) fine-tuning on both Indian and English data to prevent forgetting. Fine-tuning with only Indian data proves most effective and the resultant IN-F5 is a near-human polyglot; that enables speakers of one language (e.g., Odia) to fluently speak in another (e.g., Hindi). Our results show English pretraining aids low-resource TTS in reaching human parity. To aid progress in other low-resource languages, we study data-constrained setups and arrive at a compute optimal strategy. Finally, we show IN-F5 can synthesize unseen languages like Bhojpuri and Tulu using a human-in-the-loop approach for zero-resource TTS via synthetic data generation.
