MunTTS: A Text-to-Speech System for Mundari
Varun Gumma, Rishav Hada, Aditya Yadavalli, Pamir Gogoi, Ishani Mondal, Vivek Seshadri, Kalika Bali
TL;DR
MunTTS addresses the scarcity of TTS resources for Mundari by creating a sizeable, studio-quality speech corpus and evaluating end-to-end models. The study compares VITS variants, XTTS v2 finetuning, and a zero-shot baseline, finding VITS-44K to deliver the best subjective and objective quality (MOS ≈ 3.69; lowest MCD) among the tested configurations. The work demonstrates the feasibility of high-quality, multi-speaker TTS for an extremely low-resource language and releases models to support language preservation and digital inclusion. It also discusses data collection, evaluation, and ethical considerations, highlighting practical challenges and future directions for scalable, community-driven language technologies in India.
Abstract
We present MunTTS, an end-to-end text-to-speech (TTS) system specifically for Mundari, a low-resource Indian language of the Austo-Asiatic family. Our work addresses the gap in linguistic technology for underrepresented languages by collecting and processing data to build a speech synthesis system. We begin our study by gathering a substantial dataset of Mundari text and speech and train end-to-end speech models. We also delve into the methods used for training our models, ensuring they are efficient and effective despite the data constraints. We evaluate our system with native speakers and objective metrics, demonstrating its potential as a tool for preserving and promoting the Mundari language in the digital age.
