Everyday Speech in the Indian Subcontinent
Utkarsh P
TL;DR
The paper tackles the challenge of synthesizing everyday speech in the Indian subcontinent, where code-mixing across 22 official languages is common. It proposes extending the Common Label Set (CLS) to form a superset that supports code-mixed/code-switched TTS via a unified parser, enabling zero-shot synthesis for Sanskrit and Konkani. Evaluations using MOS, AXY discrimination, and dialect considerations show the approach yields reasonable intelligibility and naturalness, with dialect-aware benefits and caveats about language-voice matching. The work demonstrates that native-accent, code-mixed synthesis is feasible without inflating the system footprint, paving the way for more accessible, multilingual TTS in highly diverse linguistic contexts.
Abstract
India has 1369 languages of which 22 are official. About 13 different scripts are used to represent these languages. A Common Label Set (CLS) was developed based on phonetics to address the issue of large vocabulary of units required in the End-to-End (E2E) framework for multilingual synthesis. The Indian language text is first converted to CLS. This approach enables seamless code switching across 13 Indian languages and English in a given native speaker's voice, which corresponds to everyday speech in the Indian subcontinent, where the population is multilingual.
